# Theory questions

1. What is NumPy, and why is it widely used in Python

- NumPy (short for Numerical Python) is a powerful open-source library in Python that provides support for:

    - Large, multi-dimensional arrays and matrices

    - A vast collection of mathematical functions to operate efficiently on these arrays
- NumPy is essential for data science, machine learning, scientific research, and any field requiring heavy numerical computations in Python.

2. How does broadcasting work in NumPy

- Broadcasting allows NumPy to perform arithmetic operations on arrays of different shapes by automatically “stretching” the smaller array along the missing dimensions so they have compatible shapes.

- It helps avoid explicit loops and lets you write fast, clean code.

- Broadcasting Rules (simplified):
- NumPy compares the shapes of the two arrays starting from the trailing (rightmost) dimensions and works backward:

    - If the dimensions are equal, or

    - If one of the dimensions is 1,

then the arrays are compatible for broadcasting along that dimension.

- If these conditions aren’t met, NumPy raises a ValueError.


3. What is a Pandas DataFrame

- A Pandas DataFrame is one of the most important data structures provided by the Pandas library in Python. It’s like a table or spreadsheet in memory, designed for working with structured data.
- Key Features of a Pandas DataFrame:
    - 2-dimensional labeled data structure
(rows and columns, both can have labels/indexes)

    - Columns can be of different data types (integers, floats, strings, etc.)

    - Supports fast, flexible data manipulation, including filtering, aggregation, joining, reshaping, and more

    - Built on top of NumPy arrays, combining efficient numerical operations with flexible data handling



4. Explain the use of the groupby() method in Pandas

- The groupby() method is used to split a DataFrame into groups based on one or more keys (columns), then apply a function (like aggregation, transformation, or filtering) to each group independently, and finally combine the results.
- groupby('Department') splits the data into groups: HR, IT, Finance.

- ['Salary'].mean() computes the average salary in each group.

- The result is a Series with the group names as index and the mean salary as values.
- groupby() helps summarize or transform data based on categories.

- It’s fundamental for data analysis tasks like reporting, segmentation, and data exploration.




5. Why is Seaborn preferred for statistical visualizations

- Seaborn is preferred for statistical visualizations because it is:

- 1. Built on Top of Matplotlib
Seaborn simplifies complex plotting commands from Matplotlib and provides high-level functions for common statistical tasks.

2. Easier to Use for Statistical Plots
It has built-in support for creating plots like:

Box plots

Violin plots

Swarm plots

Regression plots

Heatmaps (e.g., correlation matrices)

These are useful for analyzing distribution, relationships, and comparisons in data.

3. Beautiful and Informative Defaults
Seaborn comes with aesthetic default themes and color palettes, which make plots visually appealing and easier to interpret.

4. Integrates Well with Pandas
Seaborn works smoothly with Pandas DataFrames and can use column names directly, which makes it efficient for data exploration and analysis.

5. Built-in Support for Aggregation and Statistical Estimation
It can automatically perform operations like mean, standard deviation, and confidence intervals, making it perfect for summarizing data.

6. What are the differences between NumPy arrays and Python lists

- NumPy arrays and Python lists are both used to store collections of data, but they have significant differences. A key distinction is that NumPy arrays are homogeneous, meaning all elements must be of the same data type, whereas Python lists are heterogeneous, allowing elements of different types. NumPy arrays are more memory-efficient and offer faster performance, as they are implemented in C and optimized for numerical computations. They support vectorized operations, enabling element-wise computations without explicit loops, which makes them ideal for scientific and mathematical tasks. In contrast, Python lists require loops or list comprehensions for such operations, which can be slower. Additionally, NumPy arrays support multi-dimensional structures, making them suitable for handling matrices or large datasets. Overall, NumPy arrays are preferred in data science and machine learning applications, while Python lists are more suited for general-purpose programming.










7. What is a heatmap, and when should it be used?

- A heatmap is a data visualization technique that uses color gradients to represent the magnitude or intensity of data values in a two-dimensional space. Typically, it displays a matrix-like structure where each cell’s color reflects a numerical value, making it easy to identify patterns, correlations, or anomalies.

8. What does the term “vectorized operation” mean in NumPy

- The term “vectorized operation” in NumPy refers to performing operations on entire arrays (vectors, matrices, etc.) at once, without using explicit loops.

- Instead of iterating element by element (as in standard Python), NumPy performs operations using highly optimized C code under the hood, which makes it faster and more concise.

9. How does Matplotlib differ from Plotly

- Matplotlib and Plotly are both powerful Python libraries for data visualization, but they differ significantly in their approach and features. Matplotlib is primarily used for creating static, publication-quality plots and is known for its flexibility and fine-grained customization. However, it requires more code to produce complex visuals and lacks built-in interactivity. On the other hand, Plotly specializes in creating interactive plots with features like zooming, tooltips, and dynamic updates built in by default. It is ideal for web applications, dashboards, and presentations. While Matplotlib is preferred in academic and research settings for producing static images, Plotly is widely used in data analysis and business intelligence for interactive data exploration. Ultimately, the choice depends on whether you need static precision or dynamic interaction in your visualizations.










10. What is the significance of hierarchical indexing in Pandas
- Hierarchical indexing (also called MultiIndexing) in Pandas is a powerful feature that allows you to have multiple levels of indexing on a single axis (rows or columns). This enables you to work with higher-dimensional data in a 2D DataFrame or 1D Series, making it easier to organize, group, and access complex datasets.

-  Significance of Hierarchical Indexing:
Represents multi-dimensional data compactly

It allows data with multiple keys (like "State" and "City") to be stored in a clean and structured way within a 2D DataFrame.

Simplifies complex data analysis

You can easily perform grouping, reshaping (e.g., stack, unstack), and aggregation operations.

Improves data organization

It adds clarity and hierarchy to data that naturally has multiple levels, such as time-series data with Year and Month, or Company and Department.

Powerful slicing and subsetting

You can use tuple-based indexing to slice data across multiple levels.

Useful in pivot tables and grouped operations

Hierarchical indexing is automatically created during operations like groupby() or pivot_table().




11. What is the role of Seaborn’s pairplot() function


- The **pairplot()** function in Seaborn is used to create a grid of scatter plots for visualizing relationships between all pairs of numerical features in a dataset. It also shows the distribution of individual features on the diagonal using histograms or KDE plots.

Exploratory Data Analysis (EDA):
Helps in quickly identifying patterns, trends, and relationships between variables.

Visualizing Correlation:
You can visually assess how strongly variables are related — linear, non-linear, or no relationship.

Multivariate Visualization:
It shows every pairwise combination in a single, easy-to-interpret grid.

Class-wise Analysis:
When a hue parameter is specified (like a class label), it differentiates the plots by color, helping to understand how different classes are distributed.




12. What is the purpose of the describe() function in Pandas

- The describe() function in Pandas is used to generate summary statistics of a DataFrame or Series, providing a quick overview of the distribution and key properties of the data. It typically includes metrics such as count, mean, standard deviation, minimum, quartiles (25%, 50%, 75%), and maximum values for each numerical column. This helps in understanding the central tendency, spread, and range of the data, which is essential during exploratory data analysis (EDA). Additionally, when applied to categorical data, describe() provides counts of unique values, the most frequent value (top), and its frequency. Overall, describe() is a convenient and fast way to get insights into the data’s basic characteristics before deeper analysis.



13. Why is handling missing data important in Pandas

- Handling missing data in Pandas is important because missing or incomplete data can lead to inaccurate analyses, biased results, and errors in computations. Many data processing and statistical methods assume complete datasets, so ignoring missing values might cause algorithms to fail or produce misleading outputs. Properly dealing with missing data—whether by filling, interpolating, or removing it—ensures the integrity and reliability of your analysis. Additionally, identifying missing values helps you understand data quality and decide the best strategy for handling gaps based on the context of your problem. Overall, managing missing data is a crucial step for producing valid, trustworthy insights from your datasets.


14. What are the benefits of using Plotly for data visualization

- Using Plotly for data visualization offers several key benefits:

Interactivity: Plotly creates highly interactive charts with built-in features like zooming, panning, hovering tooltips, and clickable legends, making it easy to explore data dynamically.

Ease of Use: Plotly’s API is user-friendly and allows quick creation of complex visualizations with minimal code, which is especially helpful for dashboards and web apps.

Wide Range of Plot Types: It supports a broad variety of charts, including line, bar, scatter, pie, 3D plots, maps, and specialized statistical plots, covering many use cases.

Web Integration: Plotly outputs charts as HTML and JavaScript, making it seamless to embed interactive visualizations into websites, Jupyter notebooks, or dashboards without extra plugins.

Cross-platform Compatibility: Plotly works well across different platforms and environments, including Python, R, JavaScript, and more, enabling consistent visualization workflows.

Customization: It offers extensive options to customize nearly every aspect of a chart’s appearance and behavior to match branding or presentation needs.

15. How does NumPy handle multidimensional array

- NumPy handles multidimensional arrays by providing the ndarray object, which can represent arrays of any number of dimensions—1D, 2D, 3D, and beyond. Each dimension is called an axis, and the shape of the array is a tuple that specifies the size along each axis.

16. What is the role of Bokeh in data visualization

- Bokeh is a powerful Python library designed for creating interactive and visually appealing data visualizations for modern web browsers. Its main role is to enable users to build rich, customizable plots and dashboards that can be easily embedded into web applications or shared as standalone HTML files.

- Unlike static plotting libraries, Bokeh provides interactive features such as zooming, panning, hovering tooltips, and linked brushing, allowing users to explore data dynamically. It is especially useful for building complex dashboards and handling large or streaming datasets with smooth performance.

- Bokeh’s design emphasizes flexibility and integration, supporting embedding plots in Jupyter notebooks, standalone HTML documents, or server-backed web apps using Bokeh Server. This makes it a popular choice for data scientists and developers who want to create interactive visualizations that go beyond simple charts, combining ease of use with powerful customization and interactivity options.

17. Explain the difference between apply() and map() in Pandas

In Pandas, both apply() and map() are used to apply functions to data, but they differ in their scope, flexibility, and typical use cases:

map() is primarily used with Series to map values element-wise. It is often used to substitute or transform values based on a dictionary, Series, or a function. map() works well for simple value mappings or replacements.

apply() is more versatile and can be used with both Series and DataFrames. It allows you to apply a function along an axis (rows or columns in DataFrame) or element-wise in Series. apply() can handle more complex functions, aggregations, or row/column-wise operations.

18. What are some advanced features of NumPy


Some advanced features of NumPy that go beyond basic array operations include:

Broadcasting: Allows arithmetic operations on arrays of different shapes by automatically expanding the smaller array without making copies, enabling efficient vectorized computations.

Fancy Indexing and Boolean Indexing: Lets you select elements using arrays of indices or boolean masks, enabling powerful and flexible data filtering and modification.

Structured Arrays and Record Arrays: Support for heterogeneous data types within a single array, similar to a table with named columns, useful for handling complex datasets.

Universal Functions (ufuncs): Vectorized functions that operate element-wise on arrays with high performance and support for custom ufuncs.

Memory Mapping: Allows working with large datasets stored on disk as if they were in memory, useful for handling big data without loading everything at once.

Linear Algebra Module (numpy.linalg): Provides advanced matrix operations like solving linear systems, eigenvalue decomposition, singular value decomposition, and matrix inverses.

Random Number Generation: A flexible and extensible random number generator system with support for many probability distributions.

FFT (Fast Fourier Transform): Efficient computation of discrete Fourier transforms and inverse transforms for signal processing tasks.

Masked Arrays: Support for arrays with missing or invalid entries, allowing computations that ignore masked data.

Integration with C/C++ and Fortran: NumPy arrays can be efficiently shared with code written in lower-level languages, facilitating high-performance computing.


19. How does Pandas simplify time series analysis

- Pandas simplifies time series analysis by providing powerful, easy-to-use tools tailored for working with date and time data. It offers specialized data structures like DatetimeIndex, Timestamp, and Timedelta that make handling time-related data intuitive. With Pandas, you can effortlessly parse dates, resample data at different frequencies (e.g., daily to monthly), and perform time-based indexing and slicing.

- Additionally, Pandas supports rolling windows, shifts, and time zone conversions, which are essential for analyzing trends, seasonality, and lag effects in time series. Built-in functions for date offsets, periods, and frequency conversion allow easy manipulation and alignment of time series data. Overall, Pandas streamlines many complex operations required for time series analysis, enabling faster and more accurate insights without needing extensive manual handling of date-time details.

20. What is the role of a pivot table in Pandas


- A pivot table in Pandas plays the role of summarizing and reorganizing data to provide insightful, aggregated views. It allows you to reshape large datasets by grouping data based on one or more keys (like categories or time periods), and then computing aggregate statistics such as sum, mean, count, or other functions for each group.

- Pivot tables help transform raw data into a more readable and meaningful format, making it easier to analyze patterns, trends, and relationships. They are especially useful for quick exploratory data analysis and reporting, enabling users to compare subsets of data across multiple dimensions (rows and columns) in a compact table.



21. Why is NumPy’s array slicing faster than Python’s list slicing
- NumPy slicing is faster because it operates on fixed-type, contiguous memory blocks and creates views instead of copies, minimizing overhead compared to Python lists that handle more complex, scattered object references.


22. What are some common use cases for Seaborn
- Some common use cases for Seaborn include:

Exploratory Data Analysis (EDA):
Quickly visualizing distributions, relationships, and patterns in data with plots like histograms, box plots, and scatter plots.

Statistical Visualization:
Creating plots that show statistical relationships such as regression lines, confidence intervals, and categorical comparisons using tools like lmplot(), boxplot(), and violinplot().

Visualizing Categorical Data:
Easily comparing categories with count plots, bar plots, and swarm plots to understand group differences and distributions.

Correlation and Heatmaps:
Displaying correlation matrices or other pairwise relationships through heatmaps, helping identify strong or weak associations between variables.

Multivariate Analysis:
Using pair plots and joint plots to visualize interactions among multiple variables in one comprehensive view.

Enhanced Aesthetics and Themes:
Generating publication-quality graphics with appealing default styles, color palettes, and themes that improve readability and presentation.


# practical questions

In [None]:
#1. A How do you create a 2D NumPy array and calculate the sum of each row

import numpy as np

arr = np.array([[1, 2, 3],
                [4, 5, 6],
                [7, 8, 9]])

row_sums = arr.sum(axis=1)

print("Sum of each row:", row_sums)


In [None]:
#2.Write a Pandas script to find the mean of a specific column in a DataFrameA


import pandas as pd

# Example DataFrame
data = {
    'A': [10, 20, 30, 40],
    'B': [5, 15, 25, 35]
}
df = pd.DataFrame(data)

# Calculate the mean of column 'A'
mean_value = df['A'].mean()

print("Mean of column A:", mean_value)


In [None]:
#3. Create a scatter plot using Matplotlib

import matplotlib.pyplot as plt

# Sample data
x = [1, 2, 3, 4, 5]
y = [5, 7, 4, 6, 8]

# Create scatter plot
plt.scatter(x, y, color='blue', marker='o')

# Add title and labels
plt.title('Sample Scatter Plot')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')

# Show the plot
plt.show()


In [None]:
#4.How do you calculate the correlation matrix using Seaborn and visualize it with a heatmap

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Sample DataFrame
data = {
    'A': [1, 2, 3, 4, 5],
    'B': [5, 4, 3, 2, 1],
    'C': [2, 3, 4, 5, 6],
    'D': [5, 6, 7, 8, 9]
}
df = pd.DataFrame(data)

# Calculate correlation matrix
corr = df.corr()

# Create heatmap
sns.heatmap(corr, annot=True, cmap='coolwarm')

# Show plot
plt.title('Correlation Matrix Heatmap')
plt.show()


In [None]:
#5.A Generate a bar plot using Plotly

import plotly.graph_objects as go

# Sample data
categories = ['Apples', 'Bananas', 'Cherries', 'Dates']
values = [10, 15, 7, 12]

# Create bar plot
fig = go.Figure(data=[go.Bar(x=categories, y=values)])

# Add title and axis labels
fig.update_layout(
    title='Fruit Count',
    xaxis_title='Fruit',
    yaxis_title='Count'
)

# Show the plot
fig.show()


In [None]:
#6. Create a DataFrame and add a new column based on an existing column
import pandas as pd

# Create a sample DataFrame
data = {
    'Price': [100, 200, 300, 400]
}
df = pd.DataFrame(data)

# Add a new column 'Discounted_Price' which is 10% less than 'Price'
df['Discounted_Price'] = df['Price'] * 0.9

print(df)



In [None]:
#7.Write a program to perform element-wise multiplication of two NumPy arrays

import numpy as np

# Define two NumPy arrays
arr1 = np.array([1, 2, 3, 4])
arr2 = np.array([5, 6, 7, 8])

# Element-wise multiplication
result = arr1 * arr2

print("Result of element-wise multiplication:", result)


In [None]:
#8. Create a line plot with multiple lines using Matplotlib

import matplotlib.pyplot as plt

# Sample data
x = [1, 2, 3, 4, 5]
y1 = [2, 3, 5, 7, 11]
y2 = [1, 4, 6, 8, 10]

# Plot multiple lines
plt.plot(x, y1, label='Line 1', color='blue', marker='o')
plt.plot(x, y2, label='Line 2', color='red', marker='x')

# Add title and labels
plt.title('Multiple Line Plot')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')

# Show legend
plt.legend()

# Display the plot
plt.show()


In [None]:
#9. Generate a Pandas DataFrame and filter rows where a column value is greater than a threshold

import pandas as pd

# Create a sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'Score': [85, 90, 75, 88]
}
df = pd.DataFrame(data)

# Define threshold
threshold = 30

# Filter rows where 'Age' is greater than threshold
filtered_df = df[df['Age'] > threshold]

print(filtered_df)


In [None]:
#10. Create a histogram using Seaborn to visualize a distributionA

import seaborn as sns
import matplotlib.pyplot as plt

# Sample data
data = [12, 15, 14, 16, 18, 20, 21, 19, 22, 23, 20, 18, 17, 15, 14]

# Create histogram
sns.histplot(data, bins=5, kde=False, color='skyblue')

# Add title and labels
plt.title('Histogram of Sample Data')
plt.xlabel('Value')
plt.ylabel('Frequency')

# Show plot
plt.show()


In [None]:
#11.Perform matrix multiplication using NumPy

import numpy as np

# Define two matrices
A = np.array([[1, 2],
              [3, 4]])

B = np.array([[5, 6],
              [7, 8]])

# Perform matrix multiplication
result = np.matmul(A, B)
# Alternatively, you can use: result = A @ B

print("Result of matrix multiplication:\n", result)



In [None]:
#12.A Use Pandas to load a CSV file and display its first 5 rows

import pandas as pd

# Load CSV file into a DataFrame
df = pd.read_csv('your_file.csv')

# Display the first 5 rows
print(df.head())


In [None]:
#13. A Create a 3D scatter plot using Plotly.

import plotly.graph_objects as go

# Sample data
x = [1, 2, 3, 4, 5]
y = [5, 6, 7, 8, 9]
z = [9, 8, 7, 6, 5]

# Create 3D scatter plot
fig = go.Figure(data=[go.Scatter3d(
    x=x,
    y=y,
    z=z,
    mode='markers',
    marker=dict(
        size=8,
        color=z,               # Color by z value
        colorscale='Viridis',  # Colorscale
        opacity=0.8
    )
)])

# Set plot title and axis labels
fig.update_layout(
    title='3D Scatter Plot',
    scene=dict(
        xaxis_title='X Axis',
        yaxis_title='Y Axis',
        zaxis_title='Z Axis'
    )
)

fig.show()
