#Data Toolkit
#Theory Questions

1. What is NumPy, and why is it widely used in Python?
-> NumPy (Numerical Python) is a fundamental library for numerical computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays. NumPy is widely used in Python for several reasons:

* Performance: NumPy is implemented in C, which allows for efficient computation and faster execution of operations compared to standard Python lists.
* Functionality: It offers a wide range of mathematical functions, including linear algebra, statistical operations, and Fourier transforms, making it suitable for scientific computing.
*Interoperability: NumPy arrays can be easily integrated with other libraries, such as Pandas, Matplotlib, and SciPy, enhancing its utility in data analysis and visualization.
* Ease of Use: The array-oriented programming model simplifies complex mathematical operations, making it easier for users to write concise and readable code.

2. How does broadcasting work in NumPy?
-> Broadcasting in NumPy refers to the ability to perform arithmetic operations on arrays of different shapes and sizes. When performing operations on arrays, NumPy automatically expands the smaller array to match the shape of the larger array, allowing for element-wise operations without the need for explicit replication of data.
The rules of broadcasting are as follows:

* If the arrays have different numbers of dimensions, the shape of the smaller array is padded with ones on the left side until both shapes are the same.
* If the sizes of the dimensions do not match, NumPy checks if one of the dimensions is 1. If so, it stretches that dimension to match the other array's size.
* If the sizes of the dimensions do not match and neither is 1, broadcasting fails, and an error is raised.
For example, if you have a 1D array of shape (3,) and a 2D array of shape (3, 4), NumPy will broadcast the 1D array across the second dimension of the 2D array, allowing for element-wise operations.

3. What is a Pandas DataFrame?
-> A Pandas DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). It is similar to a spreadsheet or SQL table and is one of the primary data structures in the Pandas library. DataFrames allow for easy manipulation, analysis, and visualization of data.

4. Explain the use of the groupby() method in Pandas.
-> The groupby() method in Pandas is used to split the data into groups based on some criteria. It allows you to perform operations on these groups, such as aggregation, transformation, or filtering. For example, you can group data by a specific column and then calculate the mean, sum, or count for each group. This is particularly useful for summarizing data and performing exploratory data analysis.

5. Why is Seaborn preferred for statistical visualizations?
-> Seaborn is preferred for statistical visualizations because it provides a high-level interface for drawing attractive and informative statistical graphics. It is built on top of Matplotlib and integrates well with Pandas DataFrames. Seaborn simplifies the process of creating complex visualizations, such as heatmaps, violin plots, and pair plots, while also providing better default aesthetics and color palettes.

6. What are the differences between NumPy arrays and Python lists?
* Performance: NumPy arrays are more efficient for numerical operations and consume less memory than Python lists.
* Homogeneity: NumPy arrays are homogeneous, meaning all elements must be of the same data type, while Python lists can contain mixed data types.
* Functionality: NumPy provides a wide range of mathematical functions and operations that can be performed on arrays, while Python lists do not have built-in support for mathematical operations.

7. What is a heatmap, and when should it be used?
-> A heatmap is a data visualization technique that uses color to represent the values of a matrix or a two-dimensional dataset. It is particularly useful for visualizing correlations, patterns, and the density of data points. Heatmaps are commonly used in exploratory data analysis to identify relationships between variables or to visualize the distribution of data.

8. What does the term “vectorized operation” mean in NumPy?
-> Vectorized operations in NumPy refer to the ability to perform operations on entire arrays or large datasets without the need for explicit loops. This is achieved through the use of NumPy's underlying C and Fortran libraries, which allow for efficient computation. Vectorized operations lead to faster execution and more concise code, as they enable element-wise operations on arrays.

9. How does Matplotlib differ from Plotly?
->Interactivity: Plotly provides interactive visualizations that allow users to hover, zoom, and click on elements, while Matplotlib primarily produces static plots.

-> Ease of Use: Plotly has a more user-friendly API for creating complex visualizations, while Matplotlib can require more code for similar results.

-> Output Formats: Plotly visualizations can be easily embedded in web applications, while Matplotlib is more suited for generating static images for reports and publications.

10. What is the significance of hierarchical indexing in Pandas?
-> Hierarchical indexing (or multi-indexing) in Pandas allows for multiple levels of indexing on a DataFrame or Series. This is significant because it enables more complex data structures and facilitates easier data manipulation and analysis. Hierarchical indexing is useful for working with higher-dimensional data in a two-dimensional format, allowing for more intuitive data slicing and aggregation.

11. What is the role of Seaborn’s pairplot() function?
-> Seaborn’s pairplot() function is used to create a grid of scatter plots for visualizing the pairwise relationships between multiple variables in a dataset. It is particularly useful for exploratory data analysis, as it allows you to quickly assess the relationships and distributions of different features in a DataFrame.

Key features of pairplot() include:
* Scatter Plots: It generates scatter plots for each pair of variables, helping to visualize correlations.

* Diagonal Plots: The diagonal of the grid can display histograms or kernel density estimates (KDE) to show the distribution of each variable.

* Hue Parameter: You can use the hue parameter to color the points based on a categorical variable, which helps in understanding how different categories relate to the numerical variables.


Overall, pairplot() is a powerful tool for visualizing complex datasets and identifying patterns or relationships among variables.

12. What is the purpose of the describe() function in Pandas?
-> The describe() function in Pandas is used to generate descriptive statistics of a DataFrame or Series. It provides a quick overview of the central tendency, dispersion, and shape of the dataset's distribution.

Key outputs of the describe() function include:
* Count: The number of non-null entries.
* Mean: The average value of the numerical columns.
* Standard Deviation (std): A measure of the amount of variation or dispersion in the dataset.
* Minimum (min): The smallest value in the dataset.
* 25th, 50th (median), and 75th Percentiles: These values provide insights into the distribution of the data.
* Maximum (max): The largest value in the dataset.

The describe() function is particularly useful for quickly summarizing the characteristics of numerical data and identifying potential outliers or anomalies.

13. Why is handling missing data important in Pandas?
-> Handling missing data is crucial in Pandas because missing values can lead to inaccurate analyses and results. If not addressed, they can skew statistical calculations, affect model performance, and lead to biased conclusions. Pandas provides various methods to identify, fill, or drop missing values, allowing for cleaner datasets and more reliable analyses.


14. What are the benefits of using Plotly for data visualization?
-> Plotly is a powerful library for creating interactive visualizations. Some of its key benefits include:
* Interactivity: Plotly visualizations are interactive by default, allowing users to hover, zoom, and pan, which enhances data exploration.
* Web Integration: Plotly graphs can be easily embedded in web applications and dashboards, making it suitable for sharing insights online.
* Wide Range of Charts: It supports a variety of chart types, including 3D plots, contour plots, and geographic maps, catering to diverse visualization needs.
* Customization: Plotly offers extensive customization options for styling and layout, allowing users to create visually appealing graphics.
* Integration with Dash: Plotly can be integrated with Dash, a framework for building web applications, enabling the creation of interactive dashboards.


15. How does NumPy handle multidimensional arrays?
-> NumPy handles multidimensional arrays through its core data structure called ndarray. Key features include:

* N-Dimensional Support: NumPy can create arrays with any number of dimensions (1D, 2D, 3D, etc.), allowing for complex data representations.
* Efficient Storage: Multidimensional arrays are stored in contiguous blocks of memory, which improves performance for mathematical operations.
* Array Operations: NumPy supports element-wise operations and broadcasting, enabling efficient computations across multidimensional arrays without the need for explicit loops.


16. What is the role of Bokeh in data visualization?
-> Bokeh is a Python library for creating interactive and visually appealing plots and dashboards. Its key roles include:
* Interactivity: Bokeh allows users to create interactive plots that can respond to user inputs, such as sliders and dropdowns.
* Web-Ready: Bokeh visualizations can be easily embedded in web applications, making it suitable for sharing insights online.
* Large Datasets: It is designed to handle large datasets efficiently, allowing for real-time streaming and updates.
* Customizable: Bokeh provides extensive customization options for creating complex visualizations tailored to specific needs.


17. Explain the difference between apply() and map() in Pandas.
* apply(): This function is used to apply a function along an axis of the DataFrame (rows or columns) or to a Series. It can take a function that operates on entire rows or columns, making it versatile for complex operations.
* map(): This function is primarily used with Series to apply a function element-wise. It is generally used for transforming or mapping values in a Series and is simpler than apply() for this purpose.


18. What are some advanced features of NumPy?
-> Some advanced features of NumPy include:
* Broadcasting: Allows operations on arrays of different shapes without explicit replication.
*  Fancy Indexing: Enables advanced indexing techniques, such as using arrays of indices to access multiple elements.
*  Linear Algebra Functions: Provides a suite of functions for linear algebra operations, including matrix multiplication and eigenvalue decomposition.
* Random Number Generation: Includes a module for generating random numbers and performing random sampling.
* Fourier Transforms: Supports fast Fourier transforms for frequency analysis.







19. How does Pandas simplify time series analysis?
-> Pandas simplifies time series analysis through:

* Datetime Indexing: It allows for easy indexing and slicing of time series data using datetime objects.

* Resampling: Provides functionality to resample time series data to different frequencies (e.g., daily to monthly).
* Time Zone Handling: Supports time zone conversions and operations on time zone-aware datetime objects.


* Rolling Statistics: Offers methods for calculating rolling statistics, such as moving averages, which are essential for time series analysis.







20. What is the role of a pivot table in Pandas?
-> A pivot table in Pandas is used to summarize and aggregate data. It allows users to:


* Reshape Data: Transform long-format data into a wide format, making it easier to analyze.
* Aggregation**: Perform aggregation operations (e.g., sum, mean) on specified columns based on unique values in other columns.
* Multi-dimensional Analysis: Create multi-dimensional summaries by specifying multiple index and column variables.

21. Why is NumPy’s array slicing faster than Python’s list slicing?
-> NumPy’s array slicing is faster than Python’s list slicing because:
* Contiguous Memory: NumPy arrays are stored in contiguous memory blocks, allowing for efficient access and manipulation of data.
* No Type Checking: NumPy arrays have a fixed data type, which eliminates the need for type checking during slicing operations.
* Optimized C Implementation: NumPy is implemented in C, which allows for optimized performance for array operations compared to Python’s list operations.

22. What are some common use cases for Seaborn?
-> Common use cases for Seaborn include:
* Statistical Data Visualization: Creating visualizations that represent statistical relationships, such as regression plots and distribution plots.
* Categorical Data Analysis: Visualizing relationships between categorical and numerical variables using box plots, violin plots, and bar plots.
* Heatmaps: Displaying correlation matrices and other matrix-like data in a visually appealing format.
* Pairwise Relationships: Using pairplot() to explore pairwise relationships in datasets with multiple variables.



#Practical Questions

In [None]:
#1. How do you create a 2D NumPy array and calculate the sum of each row?
import numpy as np

# Create a 2D NumPy array
array_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Calculate the sum of each row
row_sums = np.sum(array_2d, axis=1)
print(row_sums)

In [None]:
#2. Write a Pandas script to find the mean of a specific column in a DataFrame
import pandas as pd

# Sample data
data = {
    'A': [1, 2, 3, 4, 5],
    'B': [10, 20, 30, 40, 50]
}
df = pd.DataFrame(data)

# Find the mean of column 'B'
mean_B = df['B'].mean()
print(mean_B)

In [None]:
#3. Create a scatter plot using Matplotlib
import matplotlib.pyplot as plt

# Sample data
x = [1, 2, 3, 4, 5]
y = [2, 3, 5, 7, 11]

# Create scatter plot
plt.scatter(x, y)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Scatter Plot using Matplotlib')
plt.show()

In [None]:
# 4. How do you calculate the correlation matrix using Seaborn and visualize it with a heatmap?
import seaborn as sns
import pandas as pd
import numpy as np

# Sample data
data = {
    'A': np.random.rand(10),
    'B': np.random.rand(10),
    'C': np.random.rand(10),
    'D': np.random.rand(10)
}
df = pd.DataFrame(data)

# Calculate correlation matrix
correlation_matrix = df.corr()

# Visualize with heatmap
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix Heatmap')
plt.show()

In [None]:
#5. Generate a bar plot using Plotly
import plotly.graph_objects as go

# Sample data
categories = ['A', 'B', 'C', 'D']
values = [10, 15, 7, 12]

# Create bar plot
fig = go.Figure(data=[go.Bar(x=categories, y=values)])
fig.update_layout(title='Bar Plot using Plotly', xaxis_title='Categories', yaxis_title='Values')
fig.show()

In [None]:
#6. Create a DataFrame and add a new column based on an existing column
import pandas as pd

# Sample data
data = {
    'A': [1, 2, 3, 4, 5],
    'B': [10, 20, 30, 40, 50]
}
df = pd.DataFrame(data)

# Add new column based on existing column
df['C'] = df['A'] * 2
print(df)

In [None]:
#7. Write a program to perform element-wise multiplication of two NumPy arrays
import numpy as np

# Sample data
array1 = np.array([1, 2, 3, 4, 5])
array2 = np.array([10, 20, 30, 40, 50])

# Element-wise multiplication
result = array1 * array2
print(result)

In [None]:
#8. Create a line plot with multiple lines using Matplotlib
import matplotlib.pyplot as plt

# Sample data
x = [1, 2, 3, 4, 5]
y1 = [2, 3, 5, 7, 11]
y2 = [1, 4, 6, 8, 10]

# Create line plot
plt.plot(x, y1, label='Line 1')
plt.plot(x, y2, label='Line 2')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Line Plot with Multiple Lines using Matplotlib')
plt.legend()
plt.show()

In [None]:
#9. Generate a Pandas DataFrame and filter rows where a column value is greater than a threshold
import pandas as pd

# Sample data
data = {
    'A': [1, 2, 3, 4, 5],
    'B': [10, 20, 30, 40, 50]
}
df = pd.DataFrame(data)

# Filter rows where column 'B' value is greater than 25
filtered_df = df[df['B'] > 25]
print(filtered_df)

In [None]:
#10. Create a histogram using Seaborn to visualize a distribution
import seaborn as sns
import numpy as np

# Sample data
data = np.random.randn(100)

# Create histogram
sns.histplot(data, kde=True)
plt.title('Histogram using Seaborn')
plt.show()

In [None]:
#11. Perform matrix multiplication using NumPy
import numpy as np

# Sample data
matrix1 = np.array([[1, 2], [3, 4]])
matrix2 = np.array([[5, 6], [7, 8]])

# Matrix multiplication
result = np.dot(matrix1, matrix2)
print(result)

In [None]:
#12. Use Pandas to load a CSV file and display its first 5 rows
import pandas as pd

# Load CSV file
df = pd.read_csv('sample.csv') #give the csv file path here

# Display first 5 rows
print(df.head())

In [None]:
#13. Create a 3D scatter plot using Plotly
import plotly.graph_objects as go

# Sample data
x = [1, 2, 3, 4, 5]
y = [2, 3, 5, 7, 11]
z = [5, 6, 2, 8, 3]

# Create 3D scatter plot
fig = go.Figure(data=[go.Scatter3d(x=x, y=y, z=z, mode='markers')])
fig.update_layout(title='3D Scatter Plot using Plotly', scene=dict(xaxis_title='X-axis', yaxis_title='Y-axis', zaxis_title='Z-axis'))
fig.show()