1. **What is NumPy, and why is it widely used in Python**

NumPy is a special toolkit in Python that helps you work with numbers, especially when you have a lot of them.

Think of it like a super-fast spreadsheet for Python.

Instead of using regular Python lists, which are slow for big calculations, NumPy uses something called an n-dimensional array (or ndarray). This is a special type of list that stores numbers in a very organized way, making it easy and quick to do math on them.

For example, if you want to add two lists together, a normal Python list would need you to go through each number one by one. NumPy, on the other hand, can add the entire arrays together in a single, simple step. This is what makes it so powerful for things like data science and scientific research.

In [None]:
import numpy as np

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

# Adding arrays directly
result = a + b
print(result)  # [5 7 9]


2. **How does broadcasting work in NumPy**

Broadcasting in NumPy is a simple way to do math on arrays that have different sizes.

Imagine you want to add the number **5** to every single number in a list of a hundred numbers. Normally, you'd have to write a loop to do this, adding **5** to each number one by one.

Broadcasting lets you do this in one go. NumPy automatically takes that single number **5** and 'stretches' or 'broadcasts' it to match the shape of the hundred-number list. This allows the operation to happen quickly and easily without you needing to write a loop.

The core idea is that a smaller array is automatically "expanded" to fit the shape of a larger array for a specific operation.

This works for different shapes, as long as they meet a few rules:

* NumPy checks the shapes of the arrays starting from the end (the last dimension).
* If a dimension is **1**, it's "stretched" to match the other array's size.
* If the sizes don't match and one isn't **1**, NumPy will give you an error.

In [None]:
import numpy as np
a = np.array([1, 2, 3])
print(a + 10)   # [11 12 13]


3. **What is a Pandas DataFrame**

A Pandas DataFrame is like an Excel spreadsheet or a SQL table inside Python. It is a 2D labeled data structure with rows and columns, where each column can have a different data type (numbers, text, dates, etc.). Pandas is built on top of NumPy and makes working with data much easier.

Key features of DataFrame:

Data is stored in rows (index) and columns (labels).

Columns can be of different types (int, float, string).

Easy to handle missing data.

Built-in functions for filtering, grouping, merging, and reshaping.

In [None]:
import pandas as pd

data = {
    "Name": ["Amit", "Riya", "John"],
    "Age": [23, 25, 22],
    "Marks": [85, 90, 88]
}

df = pd.DataFrame(data)
print(df)


In [None]:
   Name  Age  Marks
0  Amit   23     85
1  Riya   25     90
2  John   22     88


4. **Explain the use of the groupby() method in Pandas**

The groupby() method in Pandas is used when we want to split data into groups, perform operations on each group, and then combine the results. It is very useful for summarizing and analyzing data.

Think of groupby() as:
Split → Apply → Combine

Example 1: Group by one column

In [None]:
import pandas as pd

data = {
    "Department": ["IT", "IT", "HR", "HR", "Finance"],
    "Salary": [50000, 60000, 40000, 45000, 70000]
}

df = pd.DataFrame(data)

# Average salary by department
print(df.groupby("Department")["Salary"].mean())


In [None]:
Department
Finance    70000
HR         42500
IT         55000
Name: Salary, dtype: int64


5. **Why is Seaborn preferred for statistical visualizations**

Seaborn sits on top of Matplotlib and focuses on statistics-friendly plots. It gives beautiful default styles, smart color palettes, and functions that understand DataFrames directly. This makes it easy to plot variables by column names and add statistical summaries (like confidence intervals) without extra code.

Why many people prefer it:

Better defaults: Clean themes and palettes make charts look professional with little effort.

Works with Pandas: Pass a DataFrame and column names; no manual array slicing.

Built-in stats: sns.regplot() draws a regression line with a confidence band; sns.violinplot() and sns.boxplot() summarize distributions; sns.countplot() handles categorical counts.

Faceting: sns.catplot()/sns.relplot() create small multiples by category.

Color handling: Automatic, perceptually balanced palettes for categories and continuous data.

In [None]:
import seaborn as sns
import pandas as pd

tips = sns.load_dataset("tips")
sns.set_theme()

# Compare total_bill by day with a boxplot
sns.boxplot(data=tips, x="day", y="total_bill")

# Add regression with confidence interval
sns.regplot(data=tips, x="total_bill", y="tip")


6. **What are the differences between NumPy arrays and Python lists**

NumPy arrays and Python lists both hold collections, but they are designed for different goals.

NumPy arrays (ndarray):

Homogeneous: All elements share one data type (e.g., float64), which saves memory.

Fast math: Vectorized operations run in optimized C code; no Python loops.

Shapes & broadcasting: Work in multiple dimensions (2D, 3D…) and support broadcasting.

Slicing returns views: Often no copy; edits can affect the original (fast, memory-efficient).

Rich APIs: Linear algebra, FFT, random numbers.

Python lists:

Heterogeneous: Can store mixed types ([1, "a", 3.2]).

Flexible: Great for general programming, but slower for heavy numeric work.

No vectorized math: You must loop or use list comprehensions.

Slicing copies: Typically independent of the original.

In [None]:
import numpy as np
a = np.array([1,2,3], dtype=float)
b = np.array([4,5,6], dtype=float)
print(a + b)       # [5. 7. 9.]
print(a * b)       # [ 4. 10. 18.]


In [None]:
x = [1,2,3]; y = [4,5,6]
z = [xi + yi for xi, yi in zip(x, y)]  # manual loop


7. **What is a heatmap, and when should it be used**

A heatmap displays values in a table or matrix using color intensity. Darker or brighter colors indicate larger or smaller numbers. Heatmaps are helpful when you want to see patterns, clusters, or correlations in a 2D grid of numbers at a glance.

Common uses:

Correlation matrix: See which variables move together.

Confusion matrix: Evaluate classification model performance (TP/FP/FN/TN).

Pivoted summaries: Show sales by Region × Product, or attendance by Day × Hour.

Time vs feature: Spot seasonal or hourly trends.

Good moments to use heatmaps:

When values are dense and exact numbers matter less than overall patterns.

When you want to compare many categories across two dimensions.

In [None]:
import seaborn as sns
import pandas as pd

df = sns.load_dataset("iris").drop(columns=["species"])
corr = df.corr(numeric_only=True)
sns.heatmap(corr, annot=True, fmt=".2f")


8. **What does the term “vectorized operation” mean in NumPy**

A vectorized operation means performing a mathematical operation on entire arrays at once—without explicit Python loops. Under the hood, NumPy runs fast C-level loops on contiguous memory, which makes it much faster and the code shorter.

Why it matters:

Speed: Avoids Python’s per-element overhead.

Clarity: Code reads like math.

Less bugs: Fewer loops, fewer mistakes.

In [None]:
import numpy as np
a = np.array([1, 2, 3, 4], dtype=float)

# Vectorized expression on all elements
y = a**2 + 2*a + 1        # (a + 1)^2
print(y)  # [4. 9. 16. 25.]


In [None]:
a = [1,2,3,4]
y = []
for v in a:
    y.append(v**2 + 2*v + 1)


9. **How does Matplotlib differ from Plotly**

Matplotlib is the foundational Python plotting library—excellent for static, publication-quality figures and fine-grained control. Plotly is designed for interactive, web-ready graphics with built-in hover, zoom, and tooltips.

Matplotlib:

Strengths: Full control over figure elements; integrates with scientific Python; great for static reports and journals.

Usage: Imperative API (plt.plot, plt.scatter), plus object-oriented approach.

Interactivity: Limited without extras (e.g., %matplotlib notebook, widgets).

Plotly:

Strengths: Interactivity by default (hover labels, pan/zoom, legend toggles).

Web friendly: Exports to HTML; easy to embed in dashboards/apps.

High-level API: plotly.express creates rich charts with minimal code.

In [None]:
import matplotlib.pyplot as plt
x = [1,2,3,4]; y = [3,1,4,2]
plt.plot(x, y, marker="o")
plt.title("Matplotlib Line")
plt.show()


In [None]:
import plotly.express as px
fig = px.line(x=[1,2,3,4], y=[3,1,4,2], title="Plotly Line")
fig.show()


10. **What is the significance of hierarchical indexing in Pandas**

Hierarchical indexing (MultiIndex) allows multiple levels of labels for rows and/or columns. It’s powerful when your data has natural hierarchies (e.g., Country → State → City) or when you want to reshape data elegantly.

Benefits:

Compact representation: Store higher-dimensional data in 2D tables.

Easy subsetting: Slice by one or more levels.

Group operations: Aggregate at different hierarchy levels.

Reshaping: Works smoothly with stack(), unstack(), and pivot_table().

In [None]:
import pandas as pd

data = {
    "state": ["KA","KA","MH","MH"],
    "city":  ["BLR","MYS","MUM","PUN"],
    "sales": [100, 80, 120, 90]
}
df = pd.DataFrame(data).set_index(["state","city"])
# Select one state
print(df.loc["KA"])
# Add a column level via unstack
print(df["sales"].unstack())  # cities as columns


11. **What is the role of Seaborn’s pairplot() function**

pairplot() shows pairwise relationships among several numeric variables in one figure. It creates a grid: scatter plots for every pair of variables and histograms/KDEs on the diagonal for each single variable. This gives a quick overview of correlations, clusters, and outliers.

Why it’s useful:

Exploratory data analysis (EDA): See how features relate before modeling.

Class separation: Add hue= to color points by a category and inspect separability.

Quick insights: Spot linear/nonlinear trends and anomalous points.

In [None]:
import seaborn as sns
sns.set_theme()

iris = sns.load_dataset("iris")
sns.pairplot(iris, hue="species")


12. **What is the purpose of the describe() function in Pandas**

describe() provides summary statistics for a DataFrame or Series. It’s a one-line way to understand a dataset’s center, spread, and shape.

For numeric columns, it returns:

count, mean, std (standard deviation),

min, 25%, 50% (median), 75%, and max.

For object/categorical columns (with include='object' or include='all'), it adds:

count, unique, top (most frequent), freq (its count).

In [None]:
import pandas as pd
df = pd.DataFrame({
    "Age":[20,22,21,23,22],
    "Marks":[88,91,85,90,87],
    "City":["A","B","A","B","B"]
})
print(df.describe())
print(df.describe(include="all"))


13. **Why is handling missing data important in Pandas**

Missing data can distort analysis, break models, or produce wrong conclusions. If you ignore NaNs, averages become biased, correlations weaken, and algorithms may fail (many ML models cannot handle NaN directly).

Common strategies:

Detect: df.isna().sum() to count missing values per column.

Drop: df.dropna() when rows are few and non-critical.

Impute: df.fillna(value) with domain-appropriate values:

Mean/median for numeric, mode for categorical,

Forward/backward fill for time series (ffill, bfill),

Interpolation for continuous series (interpolate()).

In [None]:
import pandas as pd
s = pd.Series([1.0, None, 3.0, None, 5.0])
print(s.isna().sum())       # count missing
print(s.fillna(s.mean()))   # mean imputation


14. **What are the benefits of using Plotly for data visualization**

Plotly excels at interactive, shareable visualizations. With minimal code, you get hover tooltips, zoom/pan, legend toggles, and export to HTML for web or dashboards.

Benefits:

Interactivity by default: Users explore data without extra coding.

High-level API: plotly.express builds complex charts from DataFrames easily.

Wide chart types: From basic line/scatter to maps, 3D, animations, and treemaps.

Dash integration: Build full web dashboards using Dash (Plotly’s framework).

Beautiful defaults: Clean themes and sensible scales.

Export options: Static images (with orca/kaleido) or interactive HTML.

In [None]:
import plotly.express as px
fig = px.scatter(
    px.data.iris(), x="sepal_width", y="sepal_length",
    color="species", title="Iris Scatter (Interactive)"
)
fig.show()


15. **How does NumPy handle multidimensional arrays**

NumPy stores data in the ndarray, which can be n-dimensional: 1D vectors, 2D matrices, 3D tensors, and more. Each array has a shape (size per dimension) and axes (dimensions you operate along).

Key ideas:

Shape & dtype: a.shape, a.ndim, a.dtype.

Slicing: a[rows, cols] for 2D; works similarly for higher dimensions.

Axis operations: a.sum(axis=0) (down columns), axis=1 (across rows).

Broadcasting: Aligns shapes for element-wise operations.

Reshape: a.reshape(new_shape) without copying when possible.

In [None]:
import numpy as np
A = np.array([[1,2,3],
              [4,5,6]])      # shape (2,3)
print(A.sum(axis=0))          # [5 7 9] column sums
print(A[:, 1])                # second column -> [2 5]

B = np.arange(24).reshape(2,3,4)  # 3D array
print(B.shape)                # (2, 3, 4)


16. **What is the role of Bokeh in data visualization**

Bokeh is a Python library for interactive, web-based visualizations. It renders to modern browsers (HTML/JS) and supports server apps for streaming and real-time dashboards.

Strengths:

Interactivity: Hover tools, pan/zoom, selections, linked brushing.

Server & streaming: Update plots from live data sources.

Fine glyph control: Low-level primitives (circles, lines) plus high-level charts.

Embeddable: Export to standalone HTML or integrate in web apps.

Large datasets: Datashader integration for millions of points.

In [None]:
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
output_notebook()

p = figure(title="Bokeh Scatter")
p.circle([1,2,3,4], [6,7,2,4], size=10)
show(p)


17. **Explain the difference between apply() and map() in Pandas**

Series.map(func or dict): Works on a Series, applying a function or mapping values via a dict. It transforms each element independently.

DataFrame.apply(func, axis=0/1): Works on a DataFrame, applying a function to each column (axis=0) or each row (axis=1). The function receives a Series.

In [None]:
import pandas as pd
s = pd.Series(["a","b","c"])
print(s.map({"a":1,"b":2}))     # c becomes NaN (no mapping)

df = pd.DataFrame({"A":[1,2,3],"B":[10,20,30]})
# Column-wise sum (axis=0 is default)
print(df.apply(sum))
# Row-wise custom logic
print(df.apply(lambda row: row["B"] - row["A"], axis=1))


18. **What are some advanced features of NumPy**

NumPy has many powerful features beyond basic arrays:

Universal functions (ufuncs): Fast element-wise ops (np.add, np.sin) with broadcasting.

einsum: Compact tensor algebra for speed and clarity.

Views vs copies: Slicing returns views when possible (memory-efficient).

Structured/record arrays: Columns with named fields and different dtypes.

Masked arrays: Handle invalid/missing numeric values in calculations.

Vectorization & broadcasting: Express complex math without loops.

Random Generator: np.random.default_rng() with modern, reproducible algorithms.

Linear algebra & FFT: np.linalg and np.fft for advanced math.

Memory control: np.memmap for out-of-core arrays on disk.

In [None]:
import numpy as np

# einsum for matrix multiply (A @ B)
A = np.arange(6).reshape(2,3)
B = np.arange(6).reshape(3,2)
C = np.einsum("ik,kj->ij", A, B)

# Structured array
dt = np.dtype([("name","U10"),("age","i4")])
rec = np.array([("Amit",23),("Riya",25)], dtype=dt)
print(rec["age"])  # [23 25]


19. **How does Pandas simplify time series analysis**

Pandas treats dates as first-class citizens. With a DatetimeIndex, you can resample, window, shift, and time-zone-convert easily.

Key features:

Parsing dates: pd.to_datetime() converts strings to timestamps.

Indexing by time: Slice by ranges (df["2024-01":"2024-03"]).

Resampling: Change frequency (e.g., daily → monthly) with resample().

Rolling windows: rolling().mean() for moving averages.

Shifts & lags: shift() for supervised learning features.

Time zones: tz_localize() and tz_convert().

In [None]:
import pandas as pd
dates = pd.date_range("2024-01-01", periods=6, freq="D")
s = pd.Series([10,12,9,15,11,13], index=dates)

monthly = s.resample("M").mean()      # monthly mean
rolling3 = s.rolling(3).mean()        # 3-day moving average
lag1 = s.shift(1)                      # previous day's value


20. **What is the role of a pivot table in Pandas**

A pivot table summarizes data like Excel’s PivotTable. It aggregates values across rows (index) and columns, producing a matrix of computed statistics (sum, mean, count, etc.).

Typical use: Summarize Sales by Region × Product, Average score by Class × Subject, etc.

In [None]:
import pandas as pd
df = pd.DataFrame({
    "Region":["East","East","West","West","East"],
    "Product":["A","B","A","B","A"],
    "Sales":[100,150,120,130,90]
})
pt = pd.pivot_table(
    df, values="Sales",
    index="Region", columns="Product",
    aggfunc="sum", fill_value=0
)
print(pt)


21. **Why is NumPy’s array slicing faster than Python’s list slicing**

NumPy arrays store data in contiguous, typed memory. Slicing an array usually returns a view (no data copied), just a new strided window into the same memory. Operations on slices run in optimized C loops—no Python per-element overhead.

Python lists store pointers to objects scattered in memory. Slicing creates a new list and copies references, and arithmetic requires Python loops or comprehension—much slower for large numeric work.

In [None]:
import numpy as np
a = np.array([1,2,3,4,5])
s = a[1:4]      # view: [2,3,4]
s[0] = 99
print(a)        # [1,99,3,4,5]  original changed


22. ** What are some common use cases for Seaborn?**

Seaborn shines in statistical data exploration and tidy DataFrame workflows.

Popular use cases:

Distribution analysis: histplot, kdeplot, displot to study shapes, skew, and outliers.

Categorical comparisons: boxplot, violinplot, stripplot, barplot for groups (e.g., sales by region).

Relationships: scatterplot, lineplot, regplot (with regression fit & CI).

Correlation & matrices: heatmap for correlation or confusion matrices.

Pairwise EDA: pairplot to scan trends across many variables.

Faceting: relplot, catplot to split visuals by category (rows/cols/hue), revealing subgroup patterns.

Time series: lineplot with estimator=None for raw trajectories or with aggregation by category.

In [None]:
import seaborn as sns
tips = sns.load_dataset("tips")
sns.barplot(data=tips, x="day", y="total_bill", estimator=sum)
sns.heatmap(tips.corr(numeric_only=True), annot=True)


                    **Practical**

Q1. **How do you create a 2D NumPy array and calculate the sum of each row**

In [None]:
import numpy as np

# Create a 2D NumPy array
arr = np.array([[1, 2, 3],
                [4, 5, 6],
                [7, 8, 9]])

print("2D Array:\n", arr)

# Sum of each row (axis=1 means row-wise)
row_sum = arr.sum(axis=1)
print("Sum of each row:", row_sum)


In [None]:
#OUTPUT
2D Array:
 [[1 2 3]
  [4 5 6]
  [7 8 9]]
Sum of each row: [ 6 15 24]


Q2. **Write a Pandas script to find the mean of a specific column in a DataFrame**

In [None]:
import pandas as pd

# Create a DataFrame
data = {
    "Name": ["Amit", "Riya", "John", "Sara"],
    "Marks": [85, 90, 88, 92]
}
df = pd.DataFrame(data)

# Mean of "Marks" column
mean_marks = df["Marks"].mean()
print("Mean of Marks:", mean_marks)


In [None]:
#OUTPUT
Mean of Marks: 88.75


Q3. **Create a scatter plot using Matplotlib**

In [None]:
import matplotlib.pyplot as plt

x = [5, 7, 8, 7, 6, 9, 5]
y = [99, 86, 87, 88, 100, 86, 103]

plt.scatter(x, y, color="blue", marker="o")
plt.title("Scatter Plot Example")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.show()


Q4. **How do you calculate the correlation matrix using Seaborn and visualize it with a heatmap**

In [None]:
import seaborn as sns
import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    "Maths": [88, 92, 80, 89, 100],
    "Science": [84, 94, 70, 96, 88],
    "English": [78, 90, 82, 85, 95]
})

# Calculate correlation matrix
corr = df.corr()

# Heatmap visualization
sns.heatmap(corr, annot=True, cmap="coolwarm")


Q5. **Generate a bar plot using Plotly**

In [None]:
import plotly.express as px

data = {"Fruits": ["Apple", "Banana", "Orange", "Grapes"],
        "Sales": [100, 150, 120, 90]}

fig = px.bar(data, x="Fruits", y="Sales", title="Fruit Sales")
fig.show()


Q6. **Create a DataFrame and add a new column based on an existing column**

In [None]:
import pandas as pd

df = pd.DataFrame({
    "Name": ["Amit", "Riya", "John"],
    "Marks": [80, 90, 85]
})

# Add new column "Result" based on Marks
df["Result"] = df["Marks"].apply(lambda x: "Pass" if x >= 85 else "Fail")
print(df)


In [None]:
#OUTPUT
   Name  Marks Result
0  Amit     80   Fail
1  Riya     90   Pass
2  John     85   Pass


Q7. **Write a program to perform element-wise multiplication of two NumPy arrays**

In [None]:
import numpy as np

a = np.array([2, 4, 6])
b = np.array([1, 3, 5])

result = a * b
print("Element-wise multiplication:", result)


In [None]:
#OUTPUT
Element-wise multiplication: [ 2 12 30]


Q8. **A Create a line plot with multiple lines using MatplotlibA**

In [None]:
import matplotlib.pyplot as plt

x = [1, 2, 3, 4, 5]
y1 = [2, 4, 6, 8, 10]
y2 = [1, 3, 5, 7, 9]

plt.plot(x, y1, label="Line 1", marker="o")
plt.plot(x, y2, label="Line 2", marker="s")

plt.title("Multiple Line Plot")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.legend()
plt.show()


Q9. **Generate a Pandas DataFrame and filter rows where a column value is greater than a threshold**

In [None]:
import pandas as pd

df = pd.DataFrame({
    "Name": ["Amit", "Riya", "John", "Sara"],
    "Marks": [70, 85, 92, 60]
})

# Filter rows where Marks > 75
filtered = df[df["Marks"] > 75]
print(filtered)


In [None]:
#OUTPUT
   Name  Marks
1  Riya     85
2  John     92


Q10. ** Create a histogram using Seaborn to visualize a distribution**

In [None]:
import seaborn as sns
import numpy as np

data = np.random.randn(100)  # 100 random numbers

sns.histplot(data, bins=10, kde=True, color="blue")


Q11. **Perform matrix multiplication using NumPy**

In [None]:
import numpy as np

A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

result = np.dot(A, B)  # or A @ B
print("Matrix Multiplication:\n", result)


In [None]:
#OUTPUT
[[19 22]
 [43 50]]


Q12. **Use Pandas to load a CSV file and display its first 5 rows**

In [None]:
import pandas as pd

# Load CSV (replace 'data.csv' with your file path)
df = pd.read_csv("data.csv")

print(df.head())  # first 5 rows


Q13. **A Create a 3D scatter plot using Plotly.**

In [None]:
import plotly.express as px
import pandas as pd

df = pd.DataFrame({
    "x": [1, 2, 3, 4, 5],
    "y": [10, 11, 12, 13, 14],
    "z": [5, 6, 7, 8, 9],
    "category": ["A", "B", "A", "B", "A"]
})

fig = px.scatter_3d(df, x="x", y="y", z="z", color="category", size="z")
fig.show()
