In [None]:
#Q​1. What is NumPy, and why is it widely used in Python?

​NumPy (Numerical Python) is a fundamental library for scientific computing in Python. It provides a high-performance multidimensional array object and tools for working with these arrays. It's widely used because it's significantly faster and more memory-efficient than standard Python lists for numerical operations. Many other data science libraries, such as Pandas and Matplotlib, are built on top of NumPy.

#Q​2. How does broadcasting work in NumPy?

​Broadcasting is a powerful feature in NumPy that allows arithmetic operations to be performed on arrays of different shapes. It automatically stretches or "broadcasts" the smaller array to match the shape of the larger array without making copies, which is efficient. For example, adding a single number to an entire array will add that number to every element.

#Q​3. What is a Pandas DataFrame?

​A Pandas DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). Think of it like a spreadsheet or a SQL table. It's the most common object in Pandas and is used to store and manipulate data.

#Q​4. Explain the use of the groupby() method in Pandas.

​The groupby() method is used to split a DataFrame into groups based on some criterion. It's often used with an aggregation function to perform a calculation on each group. The process typically involves three steps:
​Splitting the data into groups.
​Applying a function (like sum(), mean(), or count()) to each group.
​Combining the results into a single object.

#Q​5. Why is Seaborn preferred for statistical visualizations?

​Seaborn is a library built on top of Matplotlib that is specialized for creating attractive and informative statistical graphics. It has a high-level interface for drawing complex plots like heatmaps and violin plots. It's preferred because it simplifies the process of creating statistical visualizations, handles common plotting tasks automatically (like mapping data to visual properties), and has built-in themes for a cleaner aesthetic.

#Q6. What are the differences between NumPy arrays and Python lists?

The main differences are:
Data Type: NumPy arrays are homogeneous, meaning all elements must be of the same data type. Python lists are heterogeneous and can contain elements of different types.
Performance: NumPy arrays are significantly faster and more memory-efficient for numerical operations due to their fixed data type and C implementation.
Functionality: NumPy arrays are optimized for numerical operations and provide a wide range of mathematical functions. Python lists are more general-purpose data containers.

#Q7. What is a heatmap, and when should it be used?

A heatmap is a graphical representation of data where values are depicted by color. It's a great tool for visualizing a two-dimensional matrix of data, such as a correlation matrix. You should use a heatmap when you want to quickly see patterns, such as which variables are most correlated, which is a common task in exploratory data analysis.

#Q8. What does the term "vectorized operation" mean in NumPy?

A vectorized operation refers to applying a function or operation to every element in an array without using an explicit for loop. Instead, the operation is performed on the entire array at once. This approach is much more efficient because it leverages low-level C code, avoiding the overhead of the Python interpreter, leading to significant performance gains.

#Q9. How does Matplotlib differ from Plotly?

Matplotlib is a low-level, highly customizable plotting library. It gives you fine-grained control over every aspect of a plot, but often requires more code to create a complex visualization. Plotly, on the other hand, is a high-level library that specializes in creating interactive, web-based visualizations. Its plots can be easily embedded in web applications and support features like hovering over data points for more information.

#Q10. What is the significance of hierarchical indexing in Pandas?

​Hierarchical indexing (also known as MultiIndex) allows you to have multiple levels of row and/or column labels on a Pandas DataFrame or Series.
​Significance:
​Handling Complex Data: It's essential for working with data that has natural hierarchical relationships, such as time-series data with multiple frequencies (e.g., year, month, day) or data from a relational database.
​Data Organization: It provides a structured way to organize and slice high-dimensional data within a 2D structure.
​Simplified Operations: You can perform group-by and aggregation operations at different levels of the hierarchy, which is more efficient than creating new columns.

#Q​11. What is the role of Seaborn's pairplot() function?

​The pairplot() function creates a grid of plots, showing the relationships between all pairs of variables in a dataset.
​Diagonal Plots: The plots on the diagonal show the distribution of each individual variable (typically a histogram or KDE plot).
​Off-Diagonal Plots: The plots off the diagonal show scatter plots of the pairs of variables, helping to visualize correlations and relationships.
​It's a powerful tool for initial data exploration (EDA) to quickly identify potential correlations or patterns in a dataset.

#Q​12. What is the purpose of the describe() function in Pandas?

​The describe() function generates descriptive statistics of a DataFrame. For numerical columns, it provides:
​count: The number of non-null values.
​mean: The average value.
​std: The standard deviation.
​min, max: The minimum and maximum values.
​25%, 50%, 75%: The quartiles (25th, 50th/median, and 75th percentiles).
​For non-numerical columns, it provides a different set of statistics, such as count, unique, top, and freq. Its purpose is to give a quick summary of the central tendency, dispersion, and shape of a dataset's distribution.

#Q​13. Why is handling missing data important in Pandas?

​Missing data (represented as NaN or NaT in Pandas) is a common issue in real-world datasets. Handling it is crucial for several reasons:
​Data Integrity: Missing values can lead to inaccurate or misleading analysis. For example, a mean calculated without handling missing values might be wrong.
​Model Performance: Most machine learning algorithms cannot handle missing values and will either fail or produce poor results.
​Statistical Bias: Ignoring missing values can introduce bias into your analysis, especially if the missingness is not random.
​Common techniques for handling missing data include dropping rows/columns with missing values, filling them with a value (e.g., mean, median, or a constant), or using more advanced imputation methods.

#Q​14. What are the benefits of using Plotly for data visualization?

​Interactivity: Plotly plots are interactive out of the box. Users can zoom, pan, hover over data points to see details, and toggle traces on/off.
​Web-Based: Plots are rendered in a browser, making them easily shareable as HTML files and perfect for web-based dashboards and applications.
​Rich Plot Types: Supports a wide variety of 2D and 3D plot types, including scatter plots, bar charts, heatmaps, box plots, and more.
​Language Agnostic: Plotly has libraries for multiple languages (Python, R, Julia), allowing for cross-platform usage.
​Dash Integration: It's the core component of Dash, a framework for building analytical web applications.

#Q​15. How does NumPy handle multidimensional arrays?

​NumPy's core data structure, the ndarray, is specifically designed to handle multidimensional arrays (tensors) efficiently.
​Contiguous Memory: NumPy stores array elements in a contiguous block of memory, which allows for fast, low-level access and manipulation.
​Slicing and Indexing: It provides powerful and intuitive multidimensional slicing and indexing capabilities.

#Q​16. What is the role of Bokeh in data visualization?

​Bokeh is a Python library for creating interactive visualizations for modern web browsers. Its primary role is to build interactive plots, dashboards, and data applications, especially for large datasets.
​Key features:
​Interactive and Dynamic: Like Plotly, Bokeh's plots are interactive and can be embedded in web applications.
​Streaming Data: It is particularly well-suited for streaming data and real-time visualization.
​Server Component: Bokeh has a server component that allows for building sophisticated applications with custom user interface elements.
​While it shares similarities with Plotly, it's often considered a more "developer-centric" tool for building custom, data-driven web applications from the ground up.
​Shape and Strides: The ndarray object stores metadata like its shape (a tuple of dimensions) and strides (the number of bytes to skip to get to the next element in each dimension), which allows it to interpret the 1D block of memory as a multidimensional array.

#Q17. Explain the difference between apply() and map() in Pandas.

​map(): This method is used on a Series to substitute each value in the Series with another value. It's useful for element-wise transformations or for mapping values from a dictionary or another Series.
​apply(): This is a more general-purpose method that can be used on both a Series and a DataFrame. It applies a function (either built-in or a custom one) along an axis of the DataFrame (rows or columns). It's more flexible than map() and can handle more complex operations.

#Q​18. What are some advanced features of NumPy?

​Universal Functions (ufuncs): These are functions that operate on ndarrays in an element-by-element fashion. They are highly optimized and are the basis of many vectorized operations.
​Broadcasting: (Answered in Q2) The ability to operate on arrays of different shapes.
​Linear Algebra and Fourier Transforms: The numpy.linalg and numpy.fft modules provide powerful tools for complex mathematical operations.
​Random Number Generation: The numpy.random module provides a wide range of functions for generating random numbers from various distributions.
​Memory-mapped files: NumPy can work with large datasets stored on disk without loading the entire file into memory.
​Structured Arrays: Allows for creating heterogeneous arrays with named fields, similar to C structures.

#Q​19. How does Pandas simplify time series analysis?

​Pandas has a rich set of tools specifically for working with time-series data:
​DateTime Index: It has a dedicated DatetimeIndex that simplifies indexing, slicing, and resampling data based on time.
​Resampling: The .resample() method allows you to convert the frequency of time-series data (e.g., from daily to monthly data) with built-in aggregation methods.
​Time-based Slicing: You can easily slice a DataFrame by date strings, like df['2023-01':'2023-06'].
​Rolling/Expanding Windows: The .rolling() and .expanding() methods allow for calculating statistics over a moving window of data, which is crucial for moving averages and other time-series analysis.

#Q​20. What is the role of a pivot table in Pandas?

​A pivot table reshapes a DataFrame by aggregating data based on unique values in specified columns. It's similar to the pivot table functionality in spreadsheet software like Excel.
​Role:
​Summarization: It summarizes data in a new table, providing a multi-level index on both rows and columns.
​Data Aggregation: It allows you to quickly group and apply aggregation functions (e.g., sum, mean, count) to a large dataset.
​Cross-tabulation: It helps in creating cross-tabulation summaries to analyze relationships between two or more variables.

#Q​21. Why is NumPy's array slicing faster than Python's list slicing?

​NumPy array slicing returns a "view" of the original array, not a copy. This means it does not allocate new memory for the sliced portion. It simply creates a new object that points to a specific block of memory within the original array, along with its new shape and strides. This makes the operation incredibly fast.
​Python list slicing, in contrast, creates a new list object and copies the elements from the original list into the new one. This involves memory allocation and copying, which is computationally more expensive and time-consuming, especially for large lists.

#Q​22. What are some common use cases for Seaborn?

​Exploratory Data Analysis (EDA): pairplot(), jointplot(), and histplot() are great for a quick visual overview of a dataset.
​Distribution Visualization: histplot(), kdeplot() (Kernel Density Estimate), and violinplot() are used to visualize the distribution of a single variable or compare distributions across categories.
​Relational Plots: scatterplot() and lineplot() are used to show relationships between variables.
​Statistical Analysis Visualization: lmplot() for linear regression plots, and regplot() for plotting a regression line with a scatter plot.
​Categorical Data Visualization: boxplot(), stripplot(), and barplot() are used to compare numerical data across different categories.
​Matrix Plots: heatmap() is a key tool for visualizing correlation matrices.

In [None]:
#Q1. Create a 2D NumPy array and calculate the sum of each row

import numpy as np

arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
row_sums = arr.sum(axis=1)
print("Array:\n", arr)
print("Row sums:", row_sums)

In [1]:
#Q2. Pandas script to find the mean of a specific column

import pandas as pd

df = pd.DataFrame({
    'A': [10, 20, 30],
    'B': [15, 25, 35]
})

mean_B = df['B'].mean()
print("Mean of column B:", mean_B)


ModuleNotFoundError: No module named 'pandas'

In [2]:
#Q3. Scatter plot using Matplotlib

import matplotlib.pyplot as plt

x = [5, 7, 8, 7, 9, 12, 15]
y = [99, 86, 87, 88, 100, 86, 103]

plt.scatter(x, y, color='blue')
plt.xlabel("X values")
plt.ylabel("Y values")
plt.title("Scatter Plot")
plt.show()


ModuleNotFoundError: No module named 'matplotlib'

In [13]:
#Q4. Correlation matrix using Seaborn (heatmap)

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Create a sample DataFrame with numerical data
data = {'A': [10, 20, 30, 40, 50],
        'B': [15, 25, 35, 45, 55],
        'C': [50, 40, 30, 20, 10]}
df = pd.DataFrame(data)

# Calculate the correlation matrix
correlation_matrix = df.corr()

# Create the heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix Heatmap using Seaborn')
plt.show()

ModuleNotFoundError: No module named 'pandas'

In [14]:
#Q5. Bar plot using Plotly

import plotly.express as px
import pandas as pd

# Sample data
data = {'Category': ['A', 'B', 'C', 'D'],
        'Value': [23, 45, 12, 34]}
df = pd.DataFrame(data)

# Create the bar plot
fig = px.bar(df, x='Category', y='Value', title='Bar Plot using Plotly')
fig.show()

ModuleNotFoundError: No module named 'plotly'

In [15]:
#Q6. Create DataFrame and add new column from existing

import pandas as pd

# Create a sample DataFrame
data = {'Product': ['Laptop', 'Mouse', 'Keyboard'],
        'Price': [1200, 25, 75]}
df = pd.DataFrame(data)

# Add a new column 'Price_with_Tax' (assuming 8% tax)
df['Price_with_Tax'] = df['Price'] * 1.08

print("Original DataFrame:")
print(df[['Product', 'Price']])
print("\nDataFrame with new column:")
print(df)

ModuleNotFoundError: No module named 'pandas'

In [16]:
#Q7. Element-wise multiplication of two NumPy arrays

import numpy as np

# Create two NumPy arrays
arr1 = np.array([[1, 2, 3],
                 [4, 5, 6]])

arr2 = np.array([[10, 20, 30],
                 [40, 50, 60]])

# Perform element-wise multiplication
result = arr1 * arr2

print("Array 1:")
print(arr1)
print("\nArray 2:")
print(arr2)
print("\nElement-wise multiplication result:")
print(result)

ModuleNotFoundError: No module named 'numpy'

In [17]:
#Q8. Line plot with multiple lines using Matplotlib

import matplotlib.pyplot as plt
import numpy as np

# Sample data
x = np.arange(1, 11)
y1 = x * 2
y2 = x * 3 - 5

# Create the line plot
plt.figure(figsize=(8, 6))
plt.plot(x, y1, label='Line 1', marker='o')
plt.plot(x, y2, label='Line 2', marker='x')
plt.title('Line Plot with Multiple Lines')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.legend()
plt.grid(True)
plt.show()

ModuleNotFoundError: No module named 'matplotlib'

In [18]:
#Q9. Filter rows in Pandas DataFrame where value > threshold

import pandas as pd

# Create a sample DataFrame
data = {'Item': ['Apple', 'Banana', 'Orange', 'Grape'],
        'Price': [1.2, 0.5, 1.8, 2.5]}
df = pd.DataFrame(data)

# Filter rows where the 'Price' is greater than 1.5
filtered_df = df[df['Price'] > 1.5]

print("Original DataFrame:")
print(df)
print("\nFiltered DataFrame (Price > 1.5):")
print(filtered_df)

ModuleNotFoundError: No module named 'pandas'

In [19]:
#Q10. Histogram using Seaborn

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Generate random data
data = np.random.normal(loc=50, scale=10, size=1000)

# Create the histogram
plt.figure(figsize=(8, 6))
sns.histplot(data, bins=30, kde=True)
plt.title('Histogram of a Distribution')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

ModuleNotFoundError: No module named 'seaborn'

In [20]:
#Q11. Matrix multiplication using NumPy

import numpy as np

# Create two matrices
matrix1 = np.array([[1, 2],
                    [3, 4]])

matrix2 = np.array([[5, 6],
                    [7, 8]])

# Perform matrix multiplication using the @ operator
result_at = matrix1 @ matrix2

# Perform matrix multiplication using np.dot()
result_dot = np.dot(matrix1, matrix2)

print("Matrix 1:")
print(matrix1)
print("\nMatrix 2:")
print(matrix2)
print("\nMatrix multiplication result:")
print(result_at)

ModuleNotFoundError: No module named 'numpy'

In [21]:
#Q12. Load CSV in Pandas and display first 5 rows

import pandas as pd
import io

# Create a sample CSV data string to simulate a file
csv_data = """Name,Age,City
Alice,25,New York
Bob,30,Los Angeles
Charlie,35,Chicago
David,40,Houston
Eve,28,Phoenix
Frank,32,Philadelphia
"""
csv_file = io.StringIO(csv_data)

# Load the CSV file into a DataFrame
df = pd.read_csv(csv_file)

# Display the first 5 rows
print("The first 5 rows of the DataFrame:")
print(df.head())

ModuleNotFoundError: No module named 'pandas'

In [22]:
#Q13. 3D Scatter plot using Plotly

import plotly.express as px
import pandas as pd
import numpy as np

# Sample 3D data
np.random.seed(42)
df = pd.DataFrame({
    'x': np.random.rand(100) * 10,
    'y': np.random.rand(100) * 10,
    'z': np.random.rand(100) * 10,
    'size': np.random.rand(100) * 15,
    'color': np.random.randint(0, 5, 100)
})

# Create the 3D scatter plot
fig = px.scatter_3d(df, x='x', y='y', z='z',
                    color='color', size='size',
                    title='3D Scatter Plot using Plotly')
fig.show()

ModuleNotFoundError: No module named 'plotly'