 #                      **DATA TOOLKIT**

# QUESTION 1 --  What is NumPy, and why is it widely used in Python?


NumPy (**Num**erical **Py**thon) is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.



NumPy is widely used for several key reasons:

1.  **Performance:** NumPy arrays are more efficient for storing and manipulating large datasets compared to standard Python lists. This is because NumPy arrays are implemented in C, allowing for faster operations.
2.  **Vectorization:** NumPy allows you to perform operations on entire arrays at once without explicit loops. This vectorization leads to significantly faster and more concise code, especially for mathematical and scientific computations.
3.  **Broadcasting:** NumPy's broadcasting feature allows operations between arrays of different shapes and sizes under certain conditions, making it easier to perform element-wise operations.


# QUESTION 2 --  How does broadcasting work in NumPy?


Broadcasting is a mechanism in NumPy that allows operations to be performed on arrays of different shapes and sizes. It's a set of rules that NumPy follows to "stretch" or "replicate" the smaller array to match the shape of the larger array during arithmetic operations. This avoids the need to explicitly create larger intermediate arrays, which can save memory and improve performance.

The rules of broadcasting are as follows:

1.  **Rule 1: Equal number of dimensions:** If the two arrays have the same number of dimensions, NumPy compares their shapes dimension by dimension, starting from the trailing (rightmost) dimension. Two dimensions are compatible if they are equal or if one of them is 1.
2.  **Rule 2: Unequal number of dimensions:** If the two arrays have different numbers of dimensions, the shape of the array with fewer dimensions is padded with ones on its leading (leftmost) side until both arrays have the same number of dimensions.
3.  **Rule 3: Compatibility check:** After padding (if necessary), if all dimensions are compatible according to Rule 1, the arrays are considered broadcastable. If any dimension is incompatible, a `ValueError` is raised.

Once the arrays are broadcastable, the smaller array is conceptually "stretched" or "replicated" along the dimensions where its size is 1 to match the corresponding dimensions of the larger array. The operation is then performed element-wise on the resulting arrays.


# QUESTION 3 -- What is a Pandas DataFrame?

A Pandas DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). You can think of it as a spreadsheet or a SQL table. It is one of the most commonly used data structures in data analysis and manipulation with the pandas library.



# QUESTION 4 --  Explain the use of the groupby() method in Pandas?

## Explain the use of the groupby() method in Pandas

The `groupby()` method in pandas is used to split a DataFrame into groups based on some criteria. This is a fundamental operation in data analysis and is often used in conjunction with aggregation functions (like `sum()`, `mean()`, `count()`, `max()`, `min()`, etc.) to perform calculations on subsets of your data.

Conceptually, the `groupby()` process involves three steps:

1.  **Splitting:** The data is split into groups based on the values in one or more columns.
2.  **Applying:** A function (usually an aggregation function) is applied to each individual group.
3.  **Combining:** The results of the applied function are combined into a new DataFrame or Series.



# QUESTION 5 -- Why is Seaborn preferred for statistical visualizations?

Seaborn is a Python data visualization library based on Matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
Seaborn simplifies the process of creating visually appealing and statistically informative plots, especially when working with tabular data and exploring relationships and distributions within the data. Its built-in statistical plots, integration with Pandas, and enhanced aesthetics make it a preferred choice for many data analysts and scientists.

# QUESTION 6 -- What are the differences between NumPy arrays and Python list?

*   **Data Type:** NumPy arrays are homogeneous (all elements are the same type), while Python lists can be heterogeneous (elements can be different types).
*   **Performance:** NumPy arrays are faster for numerical operations due to their implementation in C and optimized operations. Python lists are generally slower for these tasks.
*   **Size:** NumPy arrays have a fixed size once created, while Python lists can dynamically grow or shrink.
*   **Functionality:** NumPy arrays offer extensive mathematical functions and capabilities like broadcasting, which are not available in standard Python lists.
*   **Memory:** NumPy arrays are more memory-efficient for large numerical datas.
 NumPy arrays are specialized for efficient numerical computations, while Python lists are more general-purpose data structures.

#QUESTION 7 --  What is a heatmap, and when should it be use?

A heatmap is a graphical representation of data where the individual values contained in a matrix are represented as colors. It's a way to visualize the magnitude of a phenomenon as color in two dimensions. The variation in color intensity or hue represents the variation in the data values.

Heatmaps are particularly useful for visualizing:

1.  **Correlation matrices:** To show the pairwise correlations between multiple variables. Strong positive or negative correlations will be represented by distinct colors, making it easy to identify relationships.
2.  **Data tables:** To display the values in a table in a visually intuitive way, especially when the table is large.
3.  **Missing data:** To visualize the pattern of missing values in a dataset.
4.  **Cluster analysis:** To visualize the results of clustering algorithms, where rows and columns are reordered based on their similarity.
5.  **Genomic data:** To visualize gene expression levels across different samples.
6.  **Website click data:** To show which areas of a webpage receive the most clicks.



# QUESTION 8 --  What does the term “vectorized operation” mean in NumPY?

In NumPy, a "vectorized operation" refers to performing mathematical operations on entire arrays at once, without explicitly looping through individual elements.

Instead of writing a `for` loop to perform an operation on each element of an array, NumPy's vectorized operations apply the operation to all elements simultaneously and efficiently using optimized underlying C code. This results in significantly faster execution and more concise code compared to traditional Python loops, especially for large arrays.

# QUESTION 9 -- How does Matplotlib differ from Plotly?

Here are the key differences between Matplotlib and Plotly in a nutshell:

*   **Interactivity:** Plotly creates interactive plots by default (zooming, panning, hovering), while Matplotlib primarily generates static plots.
*   **Approach:** Matplotlib is low-level and highly customizable, requiring more code for complex plots. Plotly is high-level and declarative, simplifying the creation of interactive visualizations.
*   **Output:** Matplotlib mainly outputs static images. Plotly outputs interactive web-based visualizations (HTML) and can also generate static images.
*   **Ease of Use:** Plotly can be easier for creating complex interactive plots with less code.

In short, Matplotlib offers more control for static plots, while Plotly excels at creating interactive and web-friendly visualizations.

#QUESTION 10-- What is the significance of hierarchical indexing in PandaS?


Hierarchical indexing (or MultiIndex) in Pandas allows you to have multiple levels of indexes on an axis (rows or columns) of a DataFrame or Series. Its significance lies in enabling you to work with and analyze higher-dimensional data within a 1D or 2D structure, providing a powerful way to organize, select, and reshape complex datasets, especially those with grouped or nested categories.

# QUESTION 11 -- What is the role of Seaborn’s pairplot() function?

Seaborn's `pairplot()` function creates a grid of pairwise relationships in a dataset. It plots scatterplots for each pair of numerical variables and histograms (or kernel density estimates) on the diagonal to show the distribution of each single variable. It's a great tool for quickly visualizing relationships and distributions in your data.

# QUESTION 12 -- What is the purpose of the describe() function in Pandas? (In Short)

The `describe()` function in Pandas provides a summary of descriptive statistics for the columns of a DataFrame or Series. For numerical columns, it includes count, mean, standard deviation, minimum, maximum, and quartile values. For object (string) or categorical columns, it provides count, unique values, top occurring value, and its frequency. It's useful for getting a quick overview of your data's distribution.

# QUESTION 13 -- Why is handling missing data important in Pandas?

Handling missing data in Pandas is crucial because:

1.  **Accuracy:** Missing data can skew statistical analyses and lead to inaccurate conclusions or biased models.
2.  **Functionality:** Many data analysis and machine learning algorithms cannot handle missing values and will produce errors.
3.  **Data Integrity:** Addressing missing data helps maintain the integrity and quality of your dataset.

Properly handling missing data ensures reliable analysis and allows you to use your data effectively for modeling and insights.

# QUESTION 14 -- What are the benefits of using Plotly for data visualization?

The main benefits of using Plotly for data visualization are:

*   **Interactivity:** Creates interactive plots by default (zooming, panning, hovering) without extra code.
*   **Web Integration:** Easily embeddable in web applications and dashboards (HTML output).
*   **High-Level Interface:** Simplifies creating complex and interactive plots with less code compared to lower-level libraries.
*   **Rich Plot Types:** Supports a wide range of plot types, including 3D plots, contour plots, and financial charts.


# QUESTION 15 -- How does NumPy handle multidimensional arrays?

NumPy handles multidimensional arrays using its core data structure called the `ndarray` (n-dimensional array). This structure efficiently stores and manipulates arrays with any number of dimensions (e.g., 1D vectors, 2D matrices, 3D tensors). NumPy provides a wide range of functions and operations that work seamlessly on these multidimensional arrays, allowing for efficient element-wise operations, slicing, reshaping, and mathematical computations across all dimensions.

# QUESTION 16 -- What is the role of Bokeh in data visualization?

# QUESTION 16 -- What is the role of Bokeh in data visualization?

Bokeh is a Python library that specializes in creating interactive web-based visualizations. Its primary role is to provide a flexible and powerful way to build complex statistical plots and dashboards that can be easily embedded in web pages or served as standalone applications, enabling interactive data exploration and presentation in a web browser.

# QUESTION 17 -- Explain the difference between apply() and map() in Pandas?

*   **`apply()`:** Used on a DataFrame or Series to apply a function along an axis (row or column). It's more general and can apply functions that operate on the entire row or column.
*   **`map()`:** Used only on a Series to substitute each value in the Series with another value. It's typically used for element-wise transformations or mapping values from a dictionary or Series.

In short, `map()` is for element-wise substitution on a Series, while `apply()` is for applying functions along an axis of a DataFrame or Series.

# QUESTION 18 -- What are some advanced features of NumPy?

Some advanced features of NumPy include:

*   **Broadcasting:** Performing operations on arrays of different shapes.
*   **Linear Algebra:** Functions for matrix operations, eigenvalues, eigenvectors, etc.
*   **Fourier Transforms:** Functions for analyzing frequencies in data.
*   **Random Number Generation:** Sophisticated tools for generating random numbers from various distributions.
*   **Masked Arrays:** Handling missing or invalid data.
*   **Memory Mapping:** Working with large arrays stored on disk without loading the entire array into memory.

 # QUESTION 19 --  How does Pandas simplify time series analysis?

Pandas simplifies time series analysis by providing:

*   **DatetimeIndex:** A specialized index for working with dates and times, enabling efficient indexing, slicing, and resampling of time series data.
*   **Time-aware data structures:** Data structures like Series and DataFrames are designed to handle time series data effectively.
*   **Built-in functions:** Functions for handling time zones, calculating time differences, resampling data at different frequencies (e.g., daily to monthly), and performing rolling window calculations.
*   **Integration with plotting libraries:** Easy integration with libraries like Matplotlib and Seaborn for visualizing time series data.

These features make it much easier to manipulate, analyze, and visualize time-stamped data compared to using standard Python lists or arrays.

# QUESTION 20 -- What is the role of a pivot table in Pandas?

The role of a pivot table in Pandas is to summarize and reorganize data from a DataFrame. It takes column values and turns them into the index, columns, or values of a new DataFrame. This is useful for analyzing relationships between different variables and creating summary tables, similar to pivot tables in spreadsheet software.

# QUESTION 21 -- Why is NumPy’s array slicing faster than Python’s list slicing?

NumPy array slicing is faster than Python list slicing primarily because:

1.  **Contiguous Memory Allocation:** NumPy arrays store elements in contiguous blocks of memory, allowing for efficient access and manipulation of data in chunks. Python lists store pointers to objects scattered throughout memory, requiring more overhead for accessing elements.
2.  **Fixed Data Type:** NumPy arrays are homogeneous (elements are of the same data type), which allows for optimized operations on the data without checking the type of each element during slicing. Python lists can hold heterogeneous data types, requiring type checking for each element.
3.  **C Implementation:** NumPy's core is implemented in C, which provides lower-level control and optimization for array operations, including slicing. Python lists are implemented in Python, which generally has more overhead.

In essence, NumPy's design for numerical operations and its underlying C implementation lead to significant performance advantages for array slicing compared to the more general-purpose Python lists.


# QUESTION 22 -- What are some common use cases for Seaborn?

Some common use cases for Seaborn include:

*   **Exploratory Data Analysis (EDA):** Visualizing distributions of variables and relationships between them.
*   **Statistical Data Visualization:** Creating plots that show statistical relationships and model fits.
*   **Visualizing Categorical Data:** Generating plots like box plots and bar plots to compare data across categories.
*   **Plotting Time Series Data:** Enhancing aesthetics and providing statistical summaries for time series plots.
*   **Creating Complex Multi-Panel Plots:** Easily generating grids of plots based on data subsets.
*   **Visualizing Distributions:** Creating histograms and density plots to understand single variable distributions.

# **PRACTICAL QUESTION**

#QUESTION 1 --  How do you create a 2D NumPy array and calculate the sum of each roW?

In [None]:
import numpy as np

two_d_array = np.array([[1, 2, 3],
                        [4, 5, 6],
                        [7, 8, 9]])

print("2D NumPy Array:")
print(two_d_array)
row_sums = np.sum(two_d_array, axis=1)

print("\nSum of each row:")
print(row_sums)


row_sums_method = two_d_array.sum(axis=1)

print("\nSum of each row (using .sum() method):")
print(row_sums_method)

2D NumPy Array:
[[1 2 3]
 [4 5 6]
 [7 8 9]]

Sum of each row:
[ 6 15 24]

Sum of each row (using .sum() method):
[ 6 15 24]


# QUESTION 2 -- Write a Pandas script to find the mean of a specific column in a DataFrames?

In [None]:
import pandas as pd

data = {'Column1': [10, 20, 15, 25, 30],
        'Column2': [1, 2, 3, 4, 5],
        'Column3': [100, 200, 150, 250, 300]}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

column_name = 'Column1'

column_mean = df[column_name].mean()

print(f"\nMean of the '{column_name}' column: {column_mean}")

another_column_name = 'Column3'
another_column_mean = df[another_column_name].mean()

print(f"Mean of the '{another_column_name}' column: {another_column_mean}")

Original DataFrame:
   Column1  Column2  Column3
0       10        1      100
1       20        2      200
2       15        3      150
3       25        4      250
4       30        5      300

Mean of the 'Column1' column: 20.0
Mean of the 'Column3' column: 200.0


# QUESTION 3--  Create a scatter plot using Matplotlib.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
plt.scatter(df['Column1'], df['Column3'])
plt.xlabel('Column1')
plt.ylabel('Column3')
plt.title('Scatter Plot of Column1 vs Column3')
plt.show()

# QUESTION 4 --  How do you calculate the correlation matrix using Seaborn and visualize it with a heatmaP?

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

correlation_matrix = df.corr()

print("Correlation Matrix:")
print(correlation_matrix)
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Correlation Matrix of the DataFrame')
plt.show()

# QUESTION 5 --  Generate a bar plot using Plotly?

In [None]:
import plotly.express as px
import pandas as pd

fig = px.bar(df, x='Column1', y='Column3', title='Bar Plot of Column1 vs Column3')
fig.show()

# QUESTION 6--  Create a DataFrame and add a new column based on an existing column.

In [None]:
import pandas as pd

data = {'A': [10, 20, 30, 40, 50],
        'B': [1, 2, 3, 4, 5],
        'C': ['X', 'Y', 'X', 'Y', 'X']}
df_new_column = pd.DataFrame(data)

print("Original DataFrame:")
print(df_new_column)
df_new_column['D'] = df_new_column['A'] * 2
df_new_column['E'] = df_new_column['A'].apply(lambda x: 'High' if x > 30 else 'Low')


print("\nDataFrame with new columns 'D' and 'E':")
print(df_new_column)

# QUESTION 7 --Write a program to perform element-wise multiplication of two NumPy arrays.

In [None]:
import numpy as np
array1 = np.array([1, 2, 3, 4])
array2 = np.array([5, 6, 7, 8])

print("Array 1:")
print(array1)
print("\nArray 2:")
print(array2)

result_array = array1 * array2

print("\nResult of element-wise multiplication:")
print(result_array)
array3 = np.array([[1, 2], [3, 4]])
array4 = np.array([[5, 6], [7, 8]])

print("\nArray 3:")
print(array3)
print("\nArray 4:")
print(array4)

result_multi = array3 * array4

print("\nResult of element-wise multiplication (multidimensional):")
print(result_multi)

# QUESTION 8-- Create a line plot with multiple lines using Matplotlib.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd


plt.figure(figsize=(10, 6))

plt.plot(df.index, df['Column1'], label='Column1')
plt.plot(df.index, df['Column2'], label='Column2')
plt.plot(df.index, df['Column3'], label='Column3')
plt.xlabel('Index')
plt.ylabel('Values')
plt.title('Line Plot of Multiple Columns')
plt.legend()

plt.grid(True)

plt.show()

# QUESTION 9 --  Generate a Pandas DataFrame and filter rows where a column value is greater than a threshold.

In [None]:
import pandas as pd

data = {'Category': ['A', 'B', 'A', 'C', 'B', 'C', 'A'],
        'Value': [10, 25, 12, 35, 20, 40, 15],
        'ID': [101, 102, 103, 104, 105, 106, 107]}
df_filter = pd.DataFrame(data)

print("Original DataFrame:")
display(df_filter)

threshold = 20
filtered_df = df_filter[df_filter['Value'] > threshold]

print(f"\nDataFrame filtered where 'Value' is greater than {threshold}:")
display(filtered_df)

filtered_by_category = df_filter[df_filter['Category'] == 'A']

print("\nDataFrame filtered where 'Category' is 'A':")
display(filtered_by_category)

# QUESTION 10 --  Create a histogram using Seaborn to visualize a distribution

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

plt.figure(figsize=(8, 6))
sns.histplot(data=df_filter, x='Value', kde=True)


plt.title('Distribution of Values (Histogram)')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

# QUESTION 11 --  Perform matrix multiplication using NumPy.

In [None]:
import numpy as np

matrix1 = np.array([[1, 2],
                    [3, 4]])

matrix2 = np.array([[5, 6],
                    [7, 8]])

print("Matrix 1:")
print(matrix1)
print("\nMatrix 2:")
print(matrix2)

result_matrix_at = matrix1 @ matrix2

print("\nResult of matrix multiplication (using @ operator):")
print(result_matrix_at)


result_matrix_dot = np.dot(matrix1, matrix2)

print("\nResult of matrix multiplication (using np.dot()):")
print(result_matrix_dot)


matrix3 = np.array([[1, 2, 3],
                    [4, 5, 6]])

matrix4 = np.array([[7, 8],
                    [9, 10],
                    [11, 12]])

print("\nMatrix 3:")
print(matrix3)
print("\nMatrix 4:")
print(matrix4)

result_different_dims = matrix3 @ matrix4

print("\nResult of matrix multiplication (2x3 @ 3x2):")
print(result_different_dims)

 # QUESTION 12  -- Use Pandas to load a CSV file and display its first 5 rowS?

In [None]:
import pandas as pd


csv_file_path = 'your_file.csv'

try:

    df_csv = pd.read_csv(csv_file_path)

    print(f"Successfully loaded data from '{csv_file_path}'")
    print("\nFirst 5 rows of the DataFrame:")
    display(df_csv.head())

except FileNotFoundError:
    print(f"Error: The file '{csv_file_path}' was not found.")
    print("Please make sure the file exists and the path is correct.")
except Exception as e:
    print(f"An error occurred while reading the CSV file: {e}")

# QUESTION 13 -- Create a 3D scatter plot using Plotly.


In [None]:
import plotly.express as px
import pandas as pd
import numpy as np

np.random.seed(42)
data_3d = {
    'x': np.random.rand(50) * 10,
    'y': np.random.rand(50) * 10,
    'z': np.random.rand(50) * 10,
    'Category': np.random.choice(['A', 'B', 'C'], 50)
}
df_3d = pd.DataFrame(data_3d)

print("Sample DataFrame for 3D scatter plot:")
display(df_3d.head())


fig = px.scatter_3d(df_3d,
                    x='x',
                    y='y',
                    z='z',
                    color='Category',
                    title='3D Scatter Plot Example')


fig.show()