# **Pandas Library**

Pandas is a powerful Python library used for data manipulation and analysis. It provides easy-to-use tools for working with structured data (like tables) efficiently.

### **Why Use Pandas?**

**1. Easy Data Handling**

Pandas lets you work with spreadsheet-like data (rows & columns) in Python.

You can load data from CSV, Excel, SQL databases, JSON, and more.

**2. Data Cleaning Made Simple**

Fix missing or incorrect data easily.

Remove duplicates, filter rows, and modify columns effortlessly.

**3. Powerful Data Analysis**

Calculate statistics (mean, median, max, min, etc.) in one line.

Group, sort, and filter data quickly.

**4. Works Well with Other Libraries**

Pandas integrates with Matplotlib/Seaborn (for plotting), Scikit-learn (for machine learning), and more.

**5. Fast & Efficient**

Optimized for performance (much faster than pure Python for large datasets).


## **Data Structure in Pandas**

Pandas has two main data structures: Series (1D) and DataFrame (2D). They are the building blocks for data manipulation in Python.

**1. Series - 1D Data (Like a Single Column)**
A Series is like a single column in Excel or a list with labels (index).

**Key Features:**

- Holds one type of data (int, float, string, etc.)

- Has an index (row labels)

- Behaves like a NumPy array but with labels

In [None]:
import pandas as pd # importing pandas and using an alias as pd

# creating series from list (default index: 0, 1, 2...)
a = pd.Series([10, 20, 30, 40, 50])
print(a)


0    10
1    20
2    30
3    40
4    50
dtype: int64


In [None]:
#creating series by specifying custom index

a = pd.Series([10, 20, 30, 40, 50], index=["A", "B", "C","D", "E"]) #specifying index for each item
print(a)

A    10
B    20
C    30
D    40
E    50
dtype: int64


### **2. DataFrames in Pandas - 2D Data**

A DataFrame is a table with rows and columns (like an Excel sheet or SQL table).

**Key Features:**

- Made up of multiple Series (columns)

- Each column can have a different data type

- Has row index and column names

In [None]:
# Create a dictionary named 'data' with three keys: "Name", "Age", and "City".
# Each key maps to a list of values.
data = {
    "Name": ["ABC", "DEF", "XYZ"], # List of names
    "Age": [25, 30, 35],           # Corresponding ages for each person
    "City": ["Pune", "Mumbai", "Delhi"] # Corresponding cities for each person
}

df = pd.DataFrame(data) ## Creating a DataFrame 'df' from the dictionary 'data', using the method pd.DataFrame
print(df) # Printing the dataframe

  Name  Age    City
0  ABC   25    Pune
1  DEF   30  Mumbai
2  XYZ   35   Delhi


In [None]:
#custom index
df = pd.DataFrame(data, index=["P1", "P2", "P3"])
print(df)

   Name  Age    City
P1  ABC   25    Pune
P2  DEF   30  Mumbai
P3  XYZ   35   Delhi


## **Data Types in Pandas**

In Pandas, data types (also called dtypes) define what kind of data is stored in each column of a DataFrame or Series. Choosing the right data type helps save memory and improves performance.

**Common Pandas Data Types**

| Pandas dtype     | Python Equivalent | Description                                 |
|------------------|-------------------|---------------------------------------------|
| `int64`          | `int`             | Integer numbers (e.g., 5, -3, 100)          |
| `float64`        | `float`           | Decimal numbers (e.g., 3.14, -0.5)          |
| `object`         | `str`             | Text (strings) or mixed data types          |
| `bool`           | `bool`            | True or False                               |
| `datetime64`     | `datetime`        | Date and time (e.g., "2023-10-05")          |
| `category`       | -                 | Limited unique values (e.g., "Male", "Female") |
| `timedelta[ns]`  | -                 | Time differences (e.g., "5 days")           |


### **Why Are Data Types Important?**

- **Memory Efficiency**

Using int32 instead of int64 saves memory for large datasets.

category dtype reduces memory for repetitive text (e.g., gender, country).

- **Performance**

Numerical operations (int, float) are faster than strings (object).

- **Correct Operations**

You can’t do math on strings (object dtype).

Dates (datetime64) allow time-based calculations.



In [None]:
df.dtypes # checks and returns the datatype for each column

Unnamed: 0,0
Name,object
Age,int64
City,object


In [None]:
import pandas as pd

data = {
    "Name": ["ABC", "DEF", "XYZ"],  # object (string)
    "Age": [25, 30, 35],                 # int64
    "Salary": [50000.0, 60000.5, 70000.0],  # float64
    "Employed": [True, False, True],      # bool
    "Gender": ["Male", "Female", "Male"], # Object(string)
    "Join_Date": pd.to_datetime(["2020-01-01", "2019-05-15", "2021-11-20"])  # datetime64
}

df = pd.DataFrame(data)
print(df.dtypes)

Name                 object
Age                   int64
Salary              float64
Employed               bool
Gender               object
Join_Date    datetime64[ns]
dtype: object


**To change the datatype in pandas for any column use astype() it converts a column to a different specified dtype.**



In [None]:
df["Age"] = df["Age"].astype("float64")  # Converts Age to float
df["Employed"] = df["Employed"].astype("int")  # Converts bool to int (1 or 0)

In [None]:
df["Join_Date"] = pd.to_datetime(df["Join_Date"]) # For converting datetime

In [None]:
df["Gender"] = df["Gender"].astype("category") # For converting categorical columns

In [None]:
print(df) # changes after converting the datatype

  Name   Age   Salary  Employed  Gender  Join_Date
0  ABC  25.0  50000.0         1    Male 2020-01-01
1  DEF  30.0  60000.5         0  Female 2019-05-15
2  XYZ  35.0  70000.0         1    Male 2021-11-20


- **Pandas automatically assigns data types when loading data.**

- **Use df.dtypes to check column types.**

- **Convert types with astype() for better performance.**

- **Use category for repetitive text, datetime64 for dates, and numerical types (int, float) for calculations.**

# **Reading Data in Pandas**

Pandas provides powerful methods ,functions to read data from various sources.

**1. Reading from CSV files**
```
pd.read_csv('data.csv')  # Basic CSV read
pd.read_csv('data.csv', sep=';')  # Custom delimiter
pd.read_csv('data.csv', header=None)  # No header row
pd.read_csv('data.csv', names=['col1','col2'])  # Custom column names
pd.read_csv('data.csv', index_col='date')  # Set index column
pd.read_csv('data.csv', skiprows=5)   # Skip first 5 rows
pd.read_csv('data.csv', na_values=['NA','missing'])  # Custom NA values

```

**2. Reading from Excel files**

```
pd.read_excel('data.xlsx')  # Read first sheet
pd.read_excel('data.xlsx', sheet_name='Sheet2')  # Specific sheet
pd.read_excel('data.xlsx', sheet_name=[0,1])  # Multiple sheets (returns dict)
pd.read_excel('data.xlsx', usecols='A:C')  # Read specific columns

```

**3. Reading from Text files**

```
pd.read_table('data.txt')  # Tab-delimited (like CSV)
pd.read_fwf('data.txt')  # Fixed-width format
```

**4.Reading Data from SQL Database**

```
import sqlite3
conn = sqlite3.connect('database.db')

# Using SQLAlchemy (recommended for production)
from sqlalchemy import create_engine
engine = create_engine('postgresql://user:pass@localhost:5432/db') #add your username and password

pd.read_sql('SELECT * FROM table', con=conn)
pd.read_sql_table('table_name', con=engine)
pd.read_sql_query('SELECT col1, col2 FROM table', con=conn)
```

**5.Reading Data from HTML tables**

```
pd.read_html('https://example.com/tables.html')  # Returns list of DataFrames
```

**6. Reading from JSON Data**

```
pd.read_json('data.json')
pd.json_normalize(nested_json)  # For nested JSON
```

**7.Reading from API'S**

```
import requests
data = requests.get('https://api.example.com/data').json()
df = pd.DataFrame(data)
```

**8.Reading data from Clipboard**

```
df = pd.read_clipboard()  # Great for quick testing
```

## **Data Inspection Methods in Pandas**

1. **head()**
```
df.head()
```
- Returns the first 5 rows of the DataFrame by default. Useful for a quick look at the data.

2. **tail()**
```
df.tail()
```
- Returns the last 5 rows of the DataFrame. Useful for inspecting the end of the dataset.

3. **shape**
```
df.shape
```
Returns a tuple (rows, columns) indicating the number of rows and columns in the DataFrame.

4. **info()**
```
df.info()
```

Displays a concise summary of the DataFrame, including:

- Number of non-null entries

- Data types of each column

- Memory usage

5. **describe()**
```
df.describe()
```
Provides a statistical summary of numerical columns, including:

- Count

- Mean

- Standard deviation

- Min/Max

- 25th, 50th (median), and 75th percentiles

6. **dtypes**
```
df.dtypes
```

- Shows the data type of each column in the DataFrame (e.g., int64, object, float64).

7.**columns**
```
df.columns
```

- Returns an Index object containing the names of all columns in the DataFrame.

8. **index**
```
df.index
```
- Returns the index (row labels) of the DataFrame.

9. **isnull()**
```
df.isnull()
```
- Returns a DataFrame of the same shape as df, with True for missing values (NaN) and False otherwise.

10. **isnull().sum()**
```
df.isnull().sum()
```
- Returns total number of missing values in each column

11. **duplicated()**
```
df.duplicated()
```
- Returns a boolean Series indicating whether each row is a duplicate of a previous row.

12. **duplicated().sum()**
```
df.duplicated().sum()
```
- Returns the total number of duplicate rows.

13. **nunique()**
```
df.nunique()
```

- Returns the number of unique values per column.

14. **value_counts()**
```
df.value_counts()
```
- Returns value counts for a Series (not DataFrame) — useful on a single column.

15. **columns.tolist()**
```
df.columns.tolist()
```
- Converts the Index of column names to a Python list.

16. **index.tolist()**	  
```
df.index.tolist()
```
- Converts the Index of row labels to a list.

17. **empty**
```
df.empty
```
- Returns True if the DataFrame is empty (no rows or columns).

18. **sample()**
```
df.sample(n)
```
- Randomly selects n rows from the DataFrame. Good for spot-checking data.


# **Data Selection Methods**

Data Selection is critical part of data analysis and manipulation. We can easily acheive this using pandas library.

19. **Fetch one column**
```
df['column name']
```
- Returns the column specified in the form of series

20. **Fetch Multiple Columns**
```
df['column1','column2']
```
- Returns multiple specified columns in the form of dataframe.

30. **iloc**
```
df.iloc['row_label']
df.iloc['row_index, column_index']
```

- Returns the row as per the specified row index.

31. **loc**

```
df.loc['row_label']
df.loc['row_label, 'column_name'']
```
- Returns the rows/columns as per the specified labels.

32. **at**

```
df.at[row_label, 'column_name']
```
- Returns a single specified cell based on the specified row label, and column name.


33. **iat**

```
df.iat[row_index, column_index]
```

- Returns a single specified cell based on specified row and column index


34. **get()**

```
df.get(key, default=None)
key: The column name you want.

default: Value to return if the column is not found. (Default is None)
```
- It's similar to **df['column_name']**, but it won’t give an error if the column doesn’t exist — it just returns None (or whatever default you provide).

35. **filter()**
```
df.filter(items=None, like=None, regex=None, axis=None)
items:	List of names to keep
like:	Substring to match in labels (e.g. all columns with 'age' in the name)
regex:	Regular expression pattern to match labels
axis:	Axis to filter on: 0 = rows, 1 = columns (default is 1)
```
- It is used to select specific rows or columns from a DataFrame by names or patterns.

36. **xs()**


```
df.xs(key, axis=0, level=None, drop_level=True)

key: The label you want to select
axis: 0 = rows (default), 1 = columns
level: The name or number of the level you want to select from
drop_level: Whether to drop the selected level from the result (default is True)

```

- It is used to select data at a particular level from a MultiIndex DataFrame (DataFrames with more than one index level).

37. **query()**

```
df.query()

```

- It lets you filter rows in a DataFrame using a string expression, similar to SQL.Features you can use inside query(), are Numeric filters, string filters, multiple conditions, variables.




#  **Data Cleaning Methods**

Data cleaning means fixing or removing incorrect, missing, or unwanted data in your DataFrame.
It helps make your data accurate and ready for analysis.

38. **dropna()**
```
df.dropna()
```
- Drops rows with missing values.

39. **dropna() - for columns**
```
df.dropna(axis=1)
```
- Drops columns with missing values.

40. **fillna()**

```
df.fillna(0)
```
- Fills the missing values with 0

41. **fillna() - forward fill**
```
df.fillna(method='ffill', inplace=True)
```
- Fills missing values with the value from the previous row.

42. **fillna() - Backward fill**
```
df.fillna(method='bfill', inplace=True)
```
- Fills missing values with the value from the next row.

43. **fillna() - using mean**
```
df['column_name'] = df['column_name'].fillna(df['column_name'].mean())
```
- Fills missing values with the column mean

44. **fillna() - using median**
```
df['column_name'] = df['column_name'].fillna(df['column_name'].median())
```
- Fills missing values with the column median

45. **fillna() - using mode**
```
df['column_name'] = df['column_name'].fillna(df['column_name'].mode()[0])
```
- Fills missing values with the column median
* mode() returns a Series. Use [0] to get the most frequent value.

46. **fillna - different values/methods for different columns**
```
df.fillna({
    'column1': value_or_method1,
    'column2': value_or_method2,
    ...
}, inplace=True)
```
- fills missing values with different values for different columns.

47. **replace()**
```
df['column'].replace({old_value: new_value, ...}, inplace=True)
```
- Replaces values in the selected column according to the provided dictionary. The keys of the dictionary represent the values to be replaced (old_value), and the dictionary values are the replacement values (new_value).

48. **rename()**
```
df.rename(columns={'old_name': 'new_name'}, inplace=True)
```
- Renames the column name with the specified new name.

49. **reset_index()**
```
df.reset_index(level=None, drop=False, inplace=False)
level: Resets a specific level if MultiIndex. (Optional)
drop: True = remove index (don’t add it as a column)
False = add index as a column
inplace: True = modify the original DataFrame
False = return a new DataFrame
```

- It is used to reset the index of a DataFrame back to the default integer index

50. **str.strip()**

```
df['column'] = df['column'].str.strip()
```

- Removes Leading/Trailing Spaces in Strings.

51. **notnull()**
```
df = df[df['column'].notnull()]
```
- Drops rows with incorrect data

52. **str.lower()/str.upper()**
```
df['column'] = df['column'].str.lower()
df['column'] = df['column'].str.upper()
```
- Converts text in a column to lowercase or uppercase

**Handling Outliers in Pandas**

53. **Simple filtering using threshold**
```
df = df[df['column'] < upper_limit]
df = df[df['column'] > lower_limit]
```
- Remove values beyond a fixed limit.

54.  **Using IQR - Interquartile Range Method**
```
Q1 = df['column'].quantile(0.25)
Q3 = df['column'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
df = df[(df['column'] >= lower) & (df['column'] <= upper)]
```
- Calculate Q1 (25th percentile) and Q3 (75th percentile)

- Compute IQR = Q3 - Q1

- Define outlier range:

  - Lower bound = Q1 - 1.5 × IQR

  - Upper bound = Q3 + 1.5 × IQR

- Remove rows outside this range.



# **Data Transformation Methods**

55. **map()**
```
df['column'].map(function_or_dict)
```
- It is used to map or transform values in a single column or Series.

56. **apply()**
```
df['column'].apply(func) # On Series
# On DataFrame
df.apply(func, axis=0)   # column-wise
df.apply(func, axis=1)   # row-wise
```

- It is used to apply a function row-wise or column-wise in DataFrame
Or to a single Series.

57. **applymap()**
```
df.applymap(function)
```
- It is used to apply a function to each element of a DataFrame (not Series).

58. **sort_values()**
```
df.sort_values(by='column_name', ascending=True)
by: Column name(s) to sort by
ascending: True = ascending order (default), False = descending
inplace: True to modify the original DataFrame
```

- It sorts the DataFrame by column values

59. **sort_index()**
```
df.sort_index(axis=0, ascending=True)
axis=0 → sort by row index
axis=1 → sort by column names
ascending: default is True
```
- Sorts the DataFrame by row or column index

60. **groupby()**
```
df.groupby('col').sum()
df.groupby(['col1', 'col2'])['val'].agg(['mean', 'max'])
```
- The groupby() method takes one or more column names as arguments, which are used to group the DataFrame.
- After grouping, you can apply various functions to each group.
- Common aggregation functions include sum(), mean(), median(), count(), min(), max(), and std().

61. **concat()**
```
pd.concat([df1, df2], axis=0)  # row-wise (default)
pd.concat([df1, df2], axis=1)  # column-wise
```
- Stack DataFrames vertically or horizontally
- Combining columns from different DataFrames, and Appending rows to existing DataFrames.

62. **merge()**
```
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None)
left, right: The two DataFrames to merge.
on: Column name(s) to join on (must be present in both DataFrames).
left_on, right_on: Use when joining on columns with different names in each DataFrame.
how: Type of join: 'inner', 'outer', 'left', 'right'.
suffixes: Suffixes to apply to overlapping column names.
```
- It is used to combine two DataFrames based on common columns or indexes, similar to SQL joins.

63. **join()**
```
df1.join(df2, how='left', on=None, lsuffix='', rsuffix='', sort=False)
other: The DataFrame to join with (the one to add columns from)
on: Column(s) in the calling DataFrame to join on instead of the index. Both
DataFrames must have those columns if set.
how: Type of join: 'left' (default), 'right', 'outer', or 'inner'
lsuffix: Suffix to add to overlapping column names in the calling DataFrame
rsuffix: Suffix to add to overlapping column names in the other DataFrame
sort: Whether to sort the join keys lexicographically (default False)
```
- It is used to combine two DataFrames based on their index (or optionally on a column). It’s similar to SQL joins but is optimized for index-based merging.

64. **append()**
```
df1.append(df2, ignore_index=False, verify_integrity=False, sort=False)
df2: DataFrame or Series to append to df1
ignore_index: If True, the resulting DataFrame will have a new integer index (default is False)
verify_integrity: If True, checks for duplicate indices and raises an error if found
sort: Sort columns if columns don’t match (default is False)
```
- append() is used to add rows of one DataFrame to the end of another DataFrame.
- It’s similar to concatenating two DataFrames vertically.
- It does not modify the original DataFrame but returns a new DataFrame.

65. **pivot()**
```
df.pivot(index='row_column', columns='column_column', values='value_column')
index: Column to use for row labels in the new table
columns: Column whose unique values become the new columns
values: Column to fill the cell values of the new table
```
- The pivot() function reshapes the DataFrame from long format to wide format by turning unique values from one column into columns.
- All combinations of index and columns must be unique.
  - If not, pivot() will raise a ValueError.
- To handle duplicates, use pivot_table() instead (it allows aggregation).

66. **melt()**
```
pd.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None, ignore_index=True)
frame: The DataFrame to melt
id_vars: Columns to keep as identifier variables (stay in each row)
value_vars: Columns to unpivot (melt); if None, all other columns are used
var_name: Name for the 'variable' column (formerly column names)
value_name: Name for the 'value' column
ignore_index: If False, retains the original index
```
- The melt() function unpivots a DataFrame from wide format to long format, turning columns into rows.

- It is used when you want to normalize your data for analysis or visualization.

#### **Binning (Discretizing values)**

**Binning is the process of grouping continuous numeric data into discrete intervals (or "bins"). It helps simplify data and uncover patterns by categorizing values — for example, grouping ages into "child", "adult", "senior".**

67. **pd.cut**
```
pd.cut(x, bins, labels=None, right=True, include_lowest=False)
```
- cut() is used to segment and sort numeric values into fixed intervals (equal width or custom-defined).

68. **pd.qcut()**

```
pd.qcut(x, q, labels=None, precision=3, duplicates='raise')
```
- qcut() is used to divide data into equal-sized quantile bins based on the distribution.

#### **Encoding**

Encoding is the process of converting categorical data (like strings or labels) into a numerical format that can be used in analysis or machine learning models.


69. **Label Encoding**

```
df['column_name'] = df['column_name'].astype('category').cat.codes
```
- Label Encoding assigns a unique integer to each category in a column.
  - Best for: Ordinal data (where order matters).

70. **One-Hot Encoding**  

```
pd.get_dummies(df, columns=['column_name'])
```

-One-Hot Encoding converts each category into a separate binary column (0 or 1).
 - Best for: Nominal data (no natural order).

# **String Methods in Pandas**

 71. **lower()**

 ```
 df['column'].str.lower()
 ```
 - Converts the text in the column to lower case.

 72. **upper()**

 ```
 df['column'].str.upper()
 ```
 - Converts the text in the column to upper case.

 73. **title()**

 ```
 df['column'].str.title()
 ```
 - Capitalizes first letter of each word

 74. **strip()**

 ```
  df['column'].str.strip()
 ```

 - Removes leading/trailing spaces

 75. **replace()**

 ```
 df['column'].str.replace(a,b)
 ```
 - Replaces substring

 76.  **contains()**

 ```
 df['column'].str.contains(x)
```
- It searches for a specified substring or regular expression within each string element of a Series.

77. **startswith('x')**

```
 df['column'].str.startswith('x')
```
- Checks if the string starts with the specified value, if yes it returns true.

78. **endswith('x')**

```
 df['column'].str.endswith('x')
```
- Checks if the string ends with the specified value, if yes it returns true.

79. **len()**

```
 df['column'].str.len()
```
- Returns length of each string.

80. **splits()**

```
df['column'].str.split(' ')
```
- Splits string on the specified delimiter.

81. **get(n)**

```
df['column'].str.get(n)
```
- Extracts an element from a list or string-like structure at a specific index n.

82. **extract(r 'regex')**

```
df['column'].str.extract(r'regex_pattern')
```
- Extracts specific substrings from a string column using regular expressions (regex). Like extracting domain name from email address. It's useful when you want to pull structured data out of unstructured text.

83. **extractall()**

```
df['column'].str.extractall(r'regex_pattern')
```
-.extractall() goes through each text row, looks for everything that matches your pattern, and gives back every match it finds.

84. **slice()**

```
df['column'].str.slice(start, stop)
```
- It is used to cut a portion (substring) from each string

85. **repeat(n)**

```
df['column'].str.replace(n)
```

- Repeats each string n no. of times

# **Time Series Analysis**

Time series analysis is studying data that's recorded over time to find patterns, understand what happened, and predict what might happen next.

The main goal is often to forecast the future based on what the past data tells us.

Time Series analysis is very important in the financial domain.

86. **to_datetime()**

```
pd.to_datetime(arg, errors='raise', dayfirst=False, format=None, utc=False)
```
- Converts a wide variety of date/time formats into datetime64 format.

87. **date_range()**

```
pd.date_range(start=None, end=None, periods=None, freq='D', tz=None)
start: Start date (string, datetime, or Timestamp)
end: End date (same formats)
periods: Number of periods to generate
freq: Frequency (default is 'D' for daily)
tz: Time zone (optional)
```
- Creates a sequence of datetime values, which is very helpful when generating or simulating time series data.

#### **Common Frequencies**

| Code             | Frequency   |
| ---------------- | ----------- |
| `'D'`            | Day         |
| `'H'`            | Hour        |
| `'T'` or `'min'` | Minute      |
| `'S'`            | Second      |
| `'M'`            | Month end   |
| `'MS'`           | Month start |
| `'W'`            | Week        |
| `'Q'`            | Quarter end |
| `'A'`            | Year end    |

88. **DatetimeIndex()**

```
pd.DatetimeIndex(data=None, freq=None, tz=None, name=None, closed=None, ambiguous='raise', dtype=None, copy=False)

data: A list, array, or Series of datetime-like objects (e.g., strings, datetime, or Timestamp)
freq: Optional frequency string (e.g., 'D', 'M', 'H')
tz: Time zone (e.g., 'UTC', 'America/New_York')
name: Optional name for the index
copy: Copy data (default False)
```
- DatetimeIndex converts a list or array of datetime-like values into an index object optimized for time-based operations.

89. **resample()**

```
df.resample(rule, axis=0, closed=None, label=None, convention='start', kind=None, loffset=None, base=None, on=None, level=None, origin='start_day', offset=None)

rule: (Required) String representing the frequency (e.g. 'D', 'M', 'W', etc.)
axis: Axis to resample on (default is 0, meaning the index)
closed: Which side of the bin is closed: 'right' (default) or 'left'
label: Where to label the bin: 'right' (default) or 'left'
on: For resampling by a column instead of the index
level: Use a specific index level for resampling (if multi-index)
loffset: Offset to shift labels (deprecated in latest pandas)
convention: For upsampling — 'start' or 'end' — fill values from start or end of the period
origin: Defines the timestamp for the start of the bins (e.g. 'epoch', 'start_day')
offset: Adjust the resampling bin edges (e.g., '1D' shifts by 1 day)
```
- Resampling is a process of converting a time series from one frequency to another:

  - **Downsampling:** Reducing frequency (e.g., daily → monthly)

  - **Upsampling:** Increasing frequency (e.g., monthly → daily)

90. **asfreq()**  

```
df.asfreq(freq, method=None, how=None, normalize=False, fill_value=None)

freq: The new frequency (e.g. 'D', 'M', 'H', etc.)
method: Optional fill method ('ffill' or 'bfill')
fill_value: Value to use for missing values
normalize: Whether to reset time to midnight (default False)
```
- It is used to change the frequency of a time series without aggregating or interpolating values. It's often used for upsampling or downsampling, where you want the data at the new frequency as-is, potentially filling missing values afterward.

91. **shift()**

```
df.shift(periods=1, freq=None, axis=0, fill_value=None)

periods: Number of periods to shift (positive = down, negative = up)
freq: Optional — shift index values by a date/time offset
axis: 0 (rows, default) or 1 (columns)
fill_value: Value to fill in for introduced missing data
```
- It is used to shift the values of a DataFrame or Series up or down along the index (usually a DatetimeIndex in time series). It's commonly used to create lag features for time series forecasting or analysis.

#### **What is moving window?**
A moving window (also called a rolling window) is a fixed-size subset of consecutive data points that "slides" through a time series to compute statistics like mean, sum, std, etc., at each step.

The window is called moving window because the window slides forward one data point at a time, recalculating the result each time.


92. **rolling()**

```
df.rolling(window, min_periods=None, center=False, win_type=None, axis=0, closed=None)

window: Size of the moving window (e.g., 3, '7D', etc.)
min_periods: Minimum observations in window to return a result
center: If True, center the window label
win_type: (Optional) Weighting method (e.g., 'triang', 'gaussian')
axis: Axis to apply (default is rows)
closed: Controls which sides of window are closed ('right', 'left', 'both', 'neither')
```

- It is used to perform rolling window operations, such as moving averages, moving sums, rolling standard deviations, etc., over time series data.

93. **expanding()**

```
df.expanding(min_periods=1, axis=0, method='single')

min_periods: Minimum number of observations needed to start calculating
axis: 0 (rows, default) or 1 (columns)
method: Internal optimization (leave as default)
```
- It is used to compute cumulative statistics (like cumulative mean, sum, etc.) from the start of a time series up to each point.
  - It includes all previous data points up to the current one.

94. **interpolate()**

```
df.interpolate(method='linear', axis=0, limit=None, inplace=False, limit_direction='forward', limit_area=None, downcast=None)

method: Interpolation technique (see below)
axis: 0 = interpolate down rows (default), 1 = across columns
limit: Max number of NaNs to fill in a row/column
limit_direction: 'forward', 'backward', or 'both'
inplace: Modify the original DataFrame if True
```

| Method                   | Description                                                  |
| ------------------------ | ------------------------------------------------------------ |
| `'linear'`               | Default. Assumes values change linearly between points       |
| `'time'`                 | Use time index for interpolation (must be a `DatetimeIndex`) |
| `'polynomial'`           | Use a polynomial function (requires `order=` param)          |
| `'spline'`               | Spline interpolation (also needs `order`)                    |
| `'pad'` / `'ffill'`      | Fill with last known value                                   |
| `'backfill'` / `'bfill'` | Fill with next known value                                   |
| `'nearest'`              | Fill with the nearest known value                            |

- It is used in pandas to fill in missing values (NaNs) using a variety of interpolation methods. It’s especially useful in time series data where values change gradually over time (e.g., temperature, stock prices).

95. **tz_localize()**

```
df.tz_localize(tz, axis=0, level=None, copy=True, ambiguous='raise', nonexistent='raise')

tz: The time zone to assign (e.g., 'UTC', 'America/New_York')

axis: Axis to localize on (default 0, for rows)

level: Used for multi-index

copy: If False, modify the data in place

ambiguous: What to do with ambiguous times (e.g. 'NaT', 'infer')

nonexistent: What to do with nonexistent times during DST transition
```

- This method is used to assign a time zone to a datetime index that currently has no time zone info (naive).

96. **tz_convert()**

```
df.tz_convert(tz, axis=0, level=None, copy=True)

tz: The new time zone to convert to (must already have a time zone)

axis: Axis to apply on (default 0)

level: Used for multi-index with datetime

copy: If False, modify in place

```

- It is used after localization — it converts an already localized datetime from one time zone to another.

97. **period_range()**

```
pd.period_range(start=None, end=None, periods=None, freq='D', name=None)

start: string or Period-like — start period

end: string or Period-like — end period

periods: int — number of periods to generate

freq: frequency alias (default 'D' for days)

name: optional name for the PeriodIndex
```

- It creates a fixed frequency sequence of Period objects — these represent time spans like months, quarters, years, or days, rather than specific timestamps.

98. **pct_change()**

```
Df.pct_change(periods=1, fill_method='pad', limit=None, freq=None, **kwargs)

periods: Number of periods to shift for forming the difference (default 1 means compare to previous row).

fill_method: How to fill missing values before computing (default 'pad').

limit: Maximum number of consecutive NaNs to fill.

freq: Frequency to conform before computing percentage change.
```

-  It computes the percentage change between the current and a prior element in a Series or DataFrame along a given axis.

99. **to_timestamp()**

```
PeriodIndex.to_timestamp(freq=None, how='start')

freq: Optional, output frequency alias (e.g., 'D', 'M'). Usually inferred.

how: 'start' (default) or 'end' — whether to convert to the period’s start or end timestamp.
```
- It is used to convert a PeriodIndex or Series of Periods to a TimestampIndex or Series of Timestamps. It turns periods (time spans) into specific points in time (timestamps).

100. **diff()**

```
df.diff(periods=1, axis=0)

periods: Number of periods to shift for calculating difference (default is 1, i.e., current row minus previous row).

axis: For DataFrame, 0 computes difference row-wise (down the rows), 1 computes difference column-wise (across columns).
```

- It computes the difference between consecutive elements in a Series or DataFrame. It’s commonly used to calculate changes or deltas between rows.

