# Data Toolkit

**1.** What is NumPy, and why is it widely used in Python?
**ans-** NumPy is a powerful library in Python for numerical computing. It provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on them. Its efficient handling of arrays makes it ideal for data manipulation and scientific computing. NumPy is widely used for tasks like data analysis, machine learning, and scientific simulations.

**2.** How does broadcasting work in NumPy?
**ans-** Broadcasting in NumPy refers to the ability to perform element-wise operations on arrays of different shapes, without needing to explicitly replicate data. It allows NumPy to "stretch" smaller arrays to match the shape of larger ones, so operations like addition, subtraction, or multiplication can be performed efficiently.

**3.** What is a Pandas DataFrame?
**ans-** A Pandas DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure in Python, similar to a table or a spreadsheet. It consists of rows and columns, where each column can contain different types of data (e.g., integers, strings, floats). DataFrames allow for easy manipulation, cleaning, filtering, and analysis of data.

**4.** Explain the use of the groupby() method in Pandas.
**ans-** The groupby() method in Pandas is used to group data based on certain criteria, enabling you to perform operations (such as aggregation, transformation, or filtering) on each group independently. It splits the data into groups, applies a function to each group, and then combines the results.

**5.** Why is Seaborn preferred for statistical visualizations?
**ans-** Seaborn is preferred for statistical visualizations due to its simplicity and built-in support for complex plots like heatmaps, violin plots, and regression plots. It integrates well with Pandas, allowing direct plotting from DataFrames. Seaborn automatically handles aesthetics like colors and styles, producing visually appealing plots. It simplifies statistical analysis by providing easy-to-use functions for visualizing distributions and relationships. Overall, Seaborn is ideal for exploring data and gaining insights with minimal effort.

**6.** What are the differences between NumPy arrays and Python lists?
**ans-** NumPy arrays are fixed-size, homogeneous data structures, meaning all elements must be of the same type, while Python lists are dynamic and can contain elements of different types. NumPy arrays support element-wise operations and are optimized for performance, while Python lists do not. NumPy provides a wide range of mathematical and statistical functions, whereas Python lists require loops for similar tasks. NumPy arrays are more memory-efficient and faster for large datasets compared to Python lists.


**7.** What is a heatmap, and when should it be used?
**ans-** A heatmap is a data visualization that uses color gradients to represent values in a matrix or 2D data. It is useful for visualizing correlations, patterns, and the intensity of relationships between variables, often used for showing correlation matrices, frequency distributions, or spatial data. Heatmaps are ideal when you need to convey large amounts of data in a compact form and quickly identify areas of high or low values.

**8.** What does the term “vectorized operation” mean in NumPy?
**ans-** In NumPy, a "vectorized operation" refers to performing element-wise operations on entire arrays without the need for explicit loops. These operations are optimized and executed in compiled C code, which makes them much faster than using Python loops. Vectorization allows for concise, efficient, and parallelized computations, taking advantage of low-level optimizations for large datasets. Examples include adding, multiplying, or performing other arithmetic operations on arrays directly.


**9.** How does Matplotlib differ from Plotly?
**ans-** Matplotlib and Plotly are both popular visualization libraries, but they have key differences:

Interactivity: Plotly is interactive by default, allowing users to zoom, pan, hover, and click on data points. Matplotlib, on the other hand, creates static plots, although it can be made interactive with additional libraries like mpl_toolkits or matplotlib.widgets.

Ease of Use: Plotly tends to be easier for creating interactive, web-ready visualizations with minimal code. Matplotlib provides more fine-grained control over plot customization but often requires more code for complex tasks.

Plot Types: Plotly offers a wider range of interactive and 3D plot types out of the box, while Matplotlib is primarily focused on 2D plots (though it supports 3D plotting via mplot3d).

**10.**  What is the significance of hierarchical indexing in Pandas?
**ans-** Hierarchical indexing in Pandas allows multiple levels of row or column labels, enabling complex data structures with nested groups. It facilitates efficient data selection, slicing, and access by multiple criteria. This indexing method is useful for grouping and performing operations on data across different levels. It also supports reshaping operations like stack() and unstack(). Overall, it makes handling multi-dimensional data more intuitive and flexible.

**11.**  What is the role of Seaborn’s pairplot() function?
**ans-** Seaborn's pairplot() function creates a grid of scatter plots to visualize relationships between multiple variables in a dataset. It plots pairwise relationships for all numeric columns in a DataFrame, making it easier to detect correlations, trends, and patterns. Diagonal elements often show univariate distributions, such as histograms or kernel density estimates (KDEs). The pairplot() function is useful for exploratory data analysis (EDA) to quickly spot interactions and distributions among variables.


**12.**  What is the purpose of the describe() function in Pandas?
**ans-** The describe() function in Pandas provides a quick summary of the statistical properties of numeric columns in a DataFrame or Series. It computes measures like count, mean, standard deviation, minimum, maximum, and percentiles (25%, 50%, and 75%) for each numeric column. This function is useful for understanding the distribution and central tendencies of your data during exploratory data analysis (EDA). It can also be used with categorical data by setting include='object' to get frequency counts.

**13.** Why is handling missing data important in Pandas?
**ans-** Handling missing data in Pandas is important because missing or NaN (Not a Number) values can affect data analysis, leading to incorrect results or errors in calculations. If not addressed, they can distort statistical analyses, such as means, sums, and regressions, or cause issues during operations like merging or grouping. Proper handling—such as filling missing values, dropping rows/columns, or imputing values—ensures data integrity and improves the quality and accuracy of analysis.

**14.** What are the benefits of using Plotly for data visualization?
**ans-** Interactivity: Plotly plots are interactive by default, allowing users to zoom, pan, hover for tooltips, and click to explore data points, making the visualizations more engaging and insightful.

Wide Range of Plots: It supports a variety of chart types, including 2D, 3D plots, maps, and statistical charts, providing flexibility in visualizing different kinds of data.

Web Integration: Plotly is designed for web-based visualizations, making it easy to embed interactive plots in websites, dashboards, or applications using platforms like Dash.

Customization: Plotly allows for extensive customization of plots, such as colors, themes, and annotations, helping create visually appealing and tailored visualizations.

**15.** How does NumPy handle multidimensional arrays?
**ans-** NumPy handles multidimensional arrays using the ndarray object, allowing arrays of any number of dimensions. It stores data efficiently and supports operations across multiple dimensions with indexing, slicing, and broadcasting. Operations on these arrays are fast due to NumPy’s optimized C backend. You can create multidimensional arrays with functions like np.array(), np.zeros(), and np.ones(). NumPy makes it easy to manipulate and perform element-wise calculations on large datasets.

**16.** What is the role of Bokeh in data visualization?
**ans-** Bokeh is a Python library for creating interactive, web-based visualizations. It allows for features like zooming, panning, and tooltips. Bokeh integrates well with web frameworks, enabling seamless embedding of plots in applications. It supports various plot types and real-time data streaming. The library offers extensive customization options for creating highly interactive and dynamic visualizations.

**17.**  Explain the difference between apply() and map() in Pandas.
**ans-** In Pandas, apply() and map() are both used for applying functions to data, but they differ in their usage:

apply(): It is more versatile and can be used on both Series and DataFrames. It allows applying a function along a specified axis (rows or columns in DataFrames), making it useful for more complex operations.

map(): It is used specifically for Series and is mainly for element-wise transformations. It can accept dictionaries, functions, or Series to map values, making it simpler and faster for straightforward replacements or transformations.

**18.**  What are some advanced features of NumPy?
**ans-** Some advanced features of NumPy include:

Broadcasting: Allows operations on arrays of different shapes by automatically aligning their dimensions, enabling efficient element-wise operations without explicit replication.

Vectorization: Enables performing operations on entire arrays at once, significantly improving performance by eliminating the need for loops.

Advanced Indexing: Supports fancy indexing and boolean indexing, allowing complex selections and modifications of array elements based on conditions or specific indices.

Linear Algebra: NumPy includes efficient implementations of linear algebra functions like matrix multiplication, eigenvalues, and singular value decomposition (SVD).

Random Module: Offers a suite of functions for generating random numbers, random sampling, and creating random distributions, useful for simulations and testing.

Masked Arrays: Provides a way to handle arrays with missing or invalid data, allowing for more flexible computations with incomplete datasets.

**19.**  How does Pandas simplify time series analysis?
**ans-** Pandas simplifies time series analysis through several key features:

DateTime Indexing: It allows you to easily create and work with time-based indexes (e.g., daily, monthly, or hourly) for efficient data retrieval and manipulation.

Resampling: Pandas supports resampling, enabling easy aggregation or downsampling of time series data (e.g., converting daily data to monthly data).

Rolling and Expanding Windows: You can compute moving averages or other rolling statistics using rolling() or expanding() functions, which are essential for time series analysis.

Shifting and Lagging: It allows for easily shifting time series data forward or backward (e.g., to compute differences between time periods) using .shift().

Time Zone Handling: Pandas offers built-in support for time zone conversion and handling, making it easier to work with time series data across different time zones.

Datetime Functions: With Pandas' pd.to_datetime() and other datetime utilities, you can easily manipulate and extract specific components like day, month, or year.

**20.** What is the role of a pivot table in Pandas?
**ans-** A pivot table in Pandas is used to summarize and aggregate data, reshaping it for easier analysis. It allows grouping data by one or more columns and applying aggregation functions (e.g., sum, mean). Pivot tables help transform long-format data into a more structured table for multi-dimensional analysis. They are useful for comparing different categories and summarizing large datasets efficiently.

**21.** Why is NumPy’s array slicing faster than Python’s list slicing?
**ans-** NumPy's array slicing is faster than Python's list slicing because NumPy arrays are stored in contiguous memory, allowing faster access. They are homogeneous in data type, which avoids overhead, unlike Python lists, which are heterogeneous. NumPy operations are implemented in C, enabling low-level optimizations, while list slicing requires Python's slower high-level operations. Additionally, NumPy avoids explicit loops, speeding up operations on large datasets.

**22.** What are some common use cases for Seaborn?
**ans-** Some common use cases for Seaborn include:

Exploratory Data Analysis (EDA): Seaborn is often used for quickly visualizing relationships, distributions, and trends in datasets, helping to identify patterns and outliers.

Statistical Plots: It is ideal for creating complex statistical visualizations, such as regression plots, violin plots, and pair plots, to show relationships between multiple variables.

Categorical Data Visualization: Seaborn makes it easy to visualize categorical data through plots like bar plots, box plots, and count plots, which reveal patterns across different categories.

Heatmaps: Seaborn is frequently used for creating heatmaps to visualize correlation matrices or data relationships in a compact, color-coded format.

Visualization of Relationships: It is used for visualizing relationships between numerical variables with plots like scatter plots, line plots, and joint plots.

#                                       Practical

**1.**  How do you create a 2D NumPy array and calculate the sum of each row?

In [None]:
import numpy as np

# Create a 2D array
array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

row_sums = np.sum(array, axis=1)

print("2D Array:")
print(array)

print("\nSum of each row:")
print(row_sums)


**2.**  Write a Pandas script to find the mean of a specific column in a DataFrame?

In [None]:
import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
        'Age': [24, 27, 22, 32, 29],
        'Score': [85, 92, 78, 88, 95]}

df = pd.DataFrame(data)

mean_age = df['Age'].mean()

mean_score = df['Score'].mean()

print(f"Mean of Age: {mean_age}")
print(f"Mean of Score: {mean_score}")


**3.** Create a scatter plot using Matplotlib.

In [None]:
import matplotlib.pyplot as plt

# Sample data for scatter plot
x = [1, 2, 3, 4, 5]
y = [2, 3, 5, 7, 11]

# Create the scatter plot
plt.scatter(x, y)

# Add labels and a title
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Simple Scatter Plot')

# Show the plot
plt.show()


**4.** How do you calculate the correlation matrix using Seaborn and visualize it with a heatmap?


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Sample DataFrame
data = {'A': [1, 2, 3, 4, 5],
        'B': [5, 4, 3, 2, 1],
        'C': [2, 3, 4, 5, 6],
        'D': [10, 9, 8, 7, 6]}

df = pd.DataFrame(data)

correlation_matrix = df.corr()

plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)

# Add title
plt.title('Correlation Matrix Heatmap')

# Show the plot
plt.show()


**5.** Generate a bar plot using Plotly.

**6.**  Create a DataFrame and add a new column based on an existing column.

In [None]:
import pandas as pd

# Create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 35, 40]}

df = pd.DataFrame(data)

df['Age_plus_5'] = df['Age'] + 5

print(df)


**7.** Write a program to perform element-wise multiplication of two NumPy arrays.

In [None]:
import numpy as np

array1 = np.array([1, 2, 3, 4])
array2 = np.array([5, 6, 7, 8])

# Perform element-wise multiplication
result = array1 * array2

print("Result of element-wise multiplication:", result)


**8.** Create a line plot with multiple lines using Matplotlib.

In [None]:
import matplotlib.pyplot as plt

# Data for plotting
x = [0, 1, 2, 3, 4, 5]
y1 = [0, 1, 4, 9, 16, 25]
y2 = [0, 1, 2, 3, 4, 5]
y3 = [0, -1, -2, -3, -4, -5]

plt.plot(x, y1, label='y = x^2', color='blue', linestyle='-')
plt.plot(x, y2, label='y = x', color='green', linestyle='--')
plt.plot(x, y3, label='y = -x', color='red', linestyle=':')

plt.title('Multiple Line Plot Example')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')

plt.legend()

plt.show()


**9.** Generate a Pandas DataFrame and filter rows where a column value is greater than a threshol.

In [None]:
import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 35, 40],
        'Salary': [50000, 60000, 70000, 80000]}

df = pd.DataFrame(data)

threshold = 60000

filtered_df = df[df['Salary'] > threshold]

# Display the filtered DataFrame
print(filtered_df)


**10.** Create a histogram using Seaborn to visualize a distribution.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Sample data: random data points following a normal distribution
data = [12, 15, 14, 10, 18, 19, 21, 22, 15, 14, 17, 16, 20, 19, 13, 18, 17, 21, 15, 14]


sns.histplot(data, bins=10, kde=True)

plt.title('Histogram with Distribution')
plt.xlabel('Value')
plt.ylabel('Frequency')

plt.show()


**11.** Perform matrix multiplication using NumPy.

In [None]:
import numpy as np

# Define two matrices
matrix1 = np.array([[1, 2], [3, 4]])
matrix2 = np.array([[5, 6], [7, 8]])

result = np.dot(matrix1, matrix2)

print("Result of matrix multiplication:")
print(result)


**12.** Use Pandas to load a CSV file and display its first 5 rows.

In [None]:
import pandas as pd

df = pd.read_csv('your_file.csv')

# Display the first 5 rows of the DataFrame
print(df.head())


**13.**  Create a 3D scatter plot using Plotly.

In [None]:
import plotly.express as px
import pandas as pd

data = {
    'x': [1, 2, 3, 4, 5],
    'y': [10, 11, 12, 13, 14],
    'z': [20, 21, 22, 23, 24],
}

df = pd.DataFrame(data)

fig = px.scatter_3d(df, x='x', y='y', z='z', title='3D Scatter Plot')


fig.show()
