# **`Data Science Learners Hub`**

**Module : Python**

**email** : [datasciencelearnershub@gmail.com](mailto:datasciencelearnershub@gmail.com)

## **`#2: DataFrames in Depth`**

4. **Creating DataFrames**
    
    - From lists, dictionaries, and arrays
    - Reading data from CSV, Excel, and other formats
5. **Basic DataFrame Operations**
    
    - Inspecting the DataFrame
    - Indexing and selecting data
    - Descriptive statistics
6. **Data Cleaning and Handling Missing Data**
    
    - Handling missing values
    - Dropping or filling missing values
    - Removing duplicates

### **`5. Basic DataFrame Operations`**

#### **`Inspecting the DataFrame`**

**Introduction:**
Inspecting a DataFrame is an essential step in understanding its structure and contents. Pandas provides several methods that allow you to gain insights into the data quickly. In this prompt, we'll explore common methods such as `head()`, `tail()`, `info()`, `shape`, and `describe()`.

**Using `head()` and `tail()`:**

1. **`head(n)`:**
   - The `head()` method displays the first `n` rows of the DataFrame. It is useful for quickly getting an overview of the dataset.

In [2]:
import pandas as pd

df = pd.read_csv('data.csv')
df_head = df.head(5)  # Display the first 5 rows

print(df_head)

     Name  Age          City
0  Laxman   25          Pune
1  Rajesh   30     Hyderabad
2     Ram   22  Mahabubnagar
3   Ganga   32  Mahabubnagar
4  Jamuna   32          Pune


2. **`tail(n)`:**
   - The `tail()` method shows the last `n` rows of the DataFrame, allowing you to inspect the end of the dataset.

In [3]:
df_tail = df.tail(5)  # Display the last 5 rows

print(df_tail)

      Name  Age          City
4   Jamuna   32          Pune
5  Namrata   15          Pune
6   Varsha   16     Hyderabad
7   Vamshi   22     Hyderabad
8   Ananya   14  Mahabubnagar


**Using `info()`:**

1. **`info()`:**
   - The `info()` method provides a concise summary of the DataFrame, including the data types, non-null counts, and memory usage.

In [4]:
df_info = df.info()

print(df_info)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    9 non-null      object
 1   Age     9 non-null      int64 
 2   City    9 non-null      object
dtypes: int64(1), object(2)
memory usage: 344.0+ bytes
None


**Using `shape`:**

1. **`shape`:**
   - The `shape` attribute returns a tuple representing the dimensions of the DataFrame (number of rows, number of columns).

In [5]:
df_shape = df.shape

print(df_shape)

(9, 3)


**Using `describe()`:**

1. **`describe()`:**
   - The `describe()` method generates descriptive statistics, including measures of central tendency, dispersion, and shape of the distribution.

In [6]:
df_describe = df.describe()

print(df_describe)

# Note : In the output only numerical columns are displayed
# Name, City are not displayed 

             Age
count   9.000000
mean   23.111111
std     7.166667
min    14.000000
25%    16.000000
50%    22.000000
75%    30.000000
max    32.000000


2. **Customizing `describe()`:**
   - You can customize the output of `describe()` to include specific percentiles or types of statistics.

In [8]:
custom_describe = df.describe(percentiles=[0.386, 0.5, 0.618, 0.786], include='all')

print(custom_describe)

          Name        Age  City
count        9   9.000000     9
unique       9        NaN     3
top     Laxman        NaN  Pune
freq         1        NaN     3
mean       NaN  23.111111   NaN
std        NaN   7.166667   NaN
min        NaN  14.000000   NaN
38.6%      NaN  22.000000   NaN
50%        NaN  22.000000   NaN
61.8%      NaN  24.832000   NaN
78.6%      NaN  30.576000   NaN
max        NaN  32.000000   NaN


#### Explanation:

In the Pandas `describe()` method, the `include` parameter is used to specify the types of columns to be included in the summary statistics. It allows you to control whether to include only numeric columns, only object (string) columns, or include all columns regardless of their data types. The `include` parameter accepts different values:

- `'all'`: This includes all columns, regardless of their data types. Both numeric and non-numeric columns will be summarized.

- `'number'`: This includes only numeric columns in the summary. Non-numeric columns, such as strings or categorical data, will be excluded from the output.

- `'object'`: This includes only object (string) columns in the summary. Numeric columns will be excluded.

Here's an example to illustrate the usage of the `include` parameter:


In [9]:
import pandas as pd

# Creating a sample DataFrame
data = {
    'numeric_col': [1, 2, 3, 4, 5],
    'string_col': ['apple', 'banana', 'orange', 'grape', 'kiwi']
}

df = pd.DataFrame(data)

# Using describe with different include values
all_columns_describe = df.describe(include='all')
numeric_columns_describe = df.describe(include='number')
object_columns_describe = df.describe(include='object')

print("Describe All Columns:")
print(all_columns_describe)

print("\nDescribe Numeric Columns Only:")
print(numeric_columns_describe)

print("\nDescribe Object (String) Columns Only:")
print(object_columns_describe)


Describe All Columns:
        numeric_col string_col
count      5.000000          5
unique          NaN          5
top             NaN      apple
freq            NaN          1
mean       3.000000        NaN
std        1.581139        NaN
min        1.000000        NaN
25%        2.000000        NaN
50%        3.000000        NaN
75%        4.000000        NaN
max        5.000000        NaN

Describe Numeric Columns Only:
       numeric_col
count     5.000000
mean      3.000000
std       1.581139
min       1.000000
25%       2.000000
50%       3.000000
75%       4.000000
max       5.000000

Describe Object (String) Columns Only:
       string_col
count           5
unique          5
top         apple
freq            1


**Conclusion:**
Inspecting a DataFrame is a crucial step in the data analysis process. Methods such as `head()`, `tail()`, `info()`, `shape`, and `describe()` provide valuable information about the structure, contents, and statistical summary of the dataset. Using these methods allows you to quickly assess the data and make informed decisions about further analysis.

#### Examples

In [12]:
import pandas as pd

# Creating a sample DataFrame
data = {
    'Name': ['Laxman', 'Laxmikanth', 'Ashwanth', 'Ashok', 'Venky'],
    'Age': [25, 30, 35, 22, 28],
    'Salary': [50000, 60000, 75000, 48000, 55000],
    'Experience': [3, 5, 8, 2, 4],
}

df = pd.DataFrame(data)

# Displaying the DataFrame
print("Original DataFrame:")
print(df)

# Using head() and tail() for an overview
print("\nFirst 3 Rows (head()):")
print(df.head(3))

print("\nLast 2 Rows (tail()):")
print(df.tail(2))

# Using info() for a summary
print("\nDataFrame Info:")
df_info = df.info()
print(df_info)

# Using shape to get dimensions
df_shape = df.shape
print("\nDataFrame Shape:", df_shape)

# Using describe() for summary statistics
df_describe = df.describe()
print("\nSummary Statistics:\n", df_describe)

# Displaying the results
print("\nResults:")





Original DataFrame:
         Name  Age  Salary  Experience
0      Laxman   25   50000           3
1  Laxmikanth   30   60000           5
2    Ashwanth   35   75000           8
3       Ashok   22   48000           2
4       Venky   28   55000           4

First 3 Rows (head()):
         Name  Age  Salary  Experience
0      Laxman   25   50000           3
1  Laxmikanth   30   60000           5
2    Ashwanth   35   75000           8

Last 2 Rows (tail()):
    Name  Age  Salary  Experience
3  Ashok   22   48000           2
4  Venky   28   55000           4

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Name        5 non-null      object
 1   Age         5 non-null      int64 
 2   Salary      5 non-null      int64 
 3   Experience  5 non-null      int64 
dtypes: int64(3), object(1)
memory usage: 288.0+ bytes
None

DataFrame Shape: (5, 4)

#### Real-world Scenario:
Consider a scenario where you have a dataset containing information about sales transactions for an e-commerce platform. You want to inspect the data to understand its structure, check for missing values, and get a quick overview of the sales performance.

In [13]:
import pandas as pd

# Sample e-commerce sales data
sales_data = {
    'OrderID': [101, 102, 103, 104, 105],
    'Product': ['Laptop', 'Smartphone', 'Tablet', 'Headphones', 'Camera'],
    'Quantity': [2, 1, 3, 2, 1],
    'Price': [1200, 800, 300, 150, 700],
    'CustomerID': [101, 102, 103, 104, 105],
    'Date': ['2022-01-01', '2022-01-02', '2022-01-02', '2022-01-03', '2022-01-03'],
}

# Creating a DataFrame from the sales data
df_sales = pd.DataFrame(sales_data)

# Inspecting the DataFrame
print("Overview of Sales Data:")
print(df_sales.head())
print("\nStructure of Sales Data:")
print(df_sales.info())
print("\nSummary Statistics of Sales Data:")
print(df_sales.describe())

Overview of Sales Data:
   OrderID     Product  Quantity  Price  CustomerID        Date
0      101      Laptop         2   1200         101  2022-01-01
1      102  Smartphone         1    800         102  2022-01-02
2      103      Tablet         3    300         103  2022-01-02
3      104  Headphones         2    150         104  2022-01-03
4      105      Camera         1    700         105  2022-01-03

Structure of Sales Data:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   OrderID     5 non-null      int64 
 1   Product     5 non-null      object
 2   Quantity    5 non-null      int64 
 3   Price       5 non-null      int64 
 4   CustomerID  5 non-null      int64 
 5   Date        5 non-null      object
dtypes: int64(4), object(2)
memory usage: 368.0+ bytes
None

Summary Statistics of Sales Data:
          OrderID  Quantity        Price  CustomerI

#### Considerations or Peculiarities:

- **Data Types:** Ensure that data types are appropriate for each column. Dates should be in datetime format, and numerical columns should have the correct data type.

- **Missing Values:** Check for missing values using methods like `isnull()` or `info()`. Decide on a strategy to handle missing data if needed.

- **Categorical Columns:** Identify and encode categorical columns appropriately. Some columns may have a finite set of categories, and using the `astype('category')` method can save memory.

#### Common Mistakes:

- **Neglecting Missing Values:** Ignoring missing values during inspection can lead to incorrect analyses. Always check for missing data and decide how to handle it.

- **Not Understanding Data Types:** Misinterpreting data types may lead to errors in analysis. Make sure to understand the meaning and representation of each column's data type.

- **Overlooking Categorical Variables:** Categorical variables may not always be automatically identified. Check and convert categorical columns if needed, especially if they are nominal or ordinal.

Inspecting the DataFrame is a crucial step to understand the data's characteristics and make informed decisions during data analysis. Adapt the example code and considerations based on the specifics of your real-world datasets.

#### **`Indexing and Selecting Data in a DataFrame`**

**Introduction:**
Indexing and selecting data in a Pandas DataFrame are fundamental operations for extracting specific subsets of information. Two main methods for this purpose are `loc[]` and `iloc[]`. In this prompt, we'll explore these methods and provide examples of conditional indexing and boolean indexing.

**Using `loc[]` for Label-Based Indexing:**

1. **Selecting Rows by Label:**
   - Use `loc[]` to select rows based on their labels (index values):

In [1]:
import pandas as pd

# Read the CSV file without setting index
df = pd.read_csv('data.csv')

# Display the DataFrame to inspect its structure
print(df)

# Set 'ID' column as the index
# df = pd.read_csv('data.csv', index_col='ID')

# Try to select the row with index 2
selected_row = df.loc[2]
print(selected_row)


      Name  Age          City
0   Laxman   25          Pune
1   Rajesh   30     Hyderabad
2      Ram   22  Mahabubnagar
3    Ganga   32  Mahabubnagar
4   Jamuna   32          Pune
5  Namrata   15          Pune
6   Varsha   16     Hyderabad
7   Vamshi   22     Hyderabad
8   Ananya   14  Mahabubnagar
Name             Ram
Age               22
City    Mahabubnagar
Name: 2, dtype: object


2. **Selecting Specific Columns for a Row:**
   - Specify both row label and column label to select a specific value:

In [2]:
specific_value = df.loc[2, 'Name']  # Select 'Name' for row with index 2
print(specific_value)

Ram


3. **Slicing Rows:**
   - Use slicing with labels to select a range of rows:

In [3]:
sliced_rows = df.loc[2:5]  # Select rows with indices 2 to 5 (inclusive)
print(sliced_rows)

      Name  Age          City
2      Ram   22  Mahabubnagar
3    Ganga   32  Mahabubnagar
4   Jamuna   32          Pune
5  Namrata   15          Pune


#### Explanation fr why stop is inclusive in slicing in case of loc[]

In the code snippet below:

```python
sliced_rows = df.loc[2:5]  # Select rows with indices 2 to 5 (inclusive)
print(sliced_rows)
```

- `df`: This is assumed to be a Pandas DataFrame.
- `loc`: This is a method in Pandas used for label-based indexing. It is not a function or constant. The `loc` method is primarily used to access a group of rows and columns by labels or a boolean array. In this case, it is used to select rows with index labels ranging from 2 to 5 (inclusive).

The syntax for the `loc` method generally follows the pattern `df.loc[row_indexer, column_indexer]`, where `row_indexer` is used to specify the rows you want to select based on their labels.

Regarding your question about slicing:

While in general Python slicing uses a stop value that is exclusive, Pandas' `loc` method is designed to be inclusive for both the start and stop values. When using `loc` for label-based indexing, both the start and stop values are considered inclusive. So, in the expression `df.loc[2:5]`, it will select rows with index labels 2, 3, 4, and 5.

In contrast, if you were using integer-based indexing or slicing with `iloc`, the stop value would be exclusive, following the standard Python slicing convention. However, with `loc`, which is label-based, inclusivity is maintained for consistency and to avoid potential confusion when working with labeled data.

#### Is loc[] a function or constant or attribute ?

In Python, methods and attributes are accessed without parentheses in certain cases. This is because, in some contexts, you are accessing the method or attribute itself, not invoking it.

In the case of Pandas' `loc`, it is used for label-based indexing, and when you use it, you're providing labels to select specific rows and columns. It doesn't require parentheses because you're not calling or invoking the method; you're using it to perform a specific operation.

Here's a brief explanation:

- **Accessing Method or Attribute**: When you access a method or attribute without using parentheses, you are referring to the method or attribute itself, not calling or executing it.

- **Calling or Invoking Method**: When you use parentheses, you are invoking or calling the method, executing the code associated with that method.

In the case of Pandas' `loc`, consider the following:

- **Accessing the Method (without parentheses)**:
  ```python
  selected_rows = df.loc[2:5]  # accessing the loc method, not invoking it
  ```

- **Calling the Method (with parentheses)**:
  ```python
  selected_rows = df.loc[2:5]()  # would be incorrect; loc is not called with parentheses in this context
  ```

So, `df.loc[2:5]` is using the `loc` method to select rows based on labels, and the absence of parentheses is consistent with the way Pandas designed the label-based indexing syntax.

4. **Selecting Rows and Columns Simultaneously:**
   - Use `loc[]` to select specific rows and columns:

In [4]:
selected_data = df.loc[2:5, ['Name', 'Age']]
print(selected_data)

      Name  Age
2      Ram   22
3    Ganga   32
4   Jamuna   32
5  Namrata   15


In [5]:
selected_data = df.loc[2:5, ['Age', 'Name']]
print(selected_data)

# Note Order of columns dosent matter

   Age     Name
2   22      Ram
3   32    Ganga
4   32   Jamuna
5   15  Namrata


In [7]:
selected_data = df.loc[[3,2,5], ['Age', 'Name']]
print(selected_data)

# Note : We can also select rows selectively

   Age     Name
3   32    Ganga
2   22      Ram
5   15  Namrata


**Using `iloc[]` for Position-Based Indexing:**

1. **Selecting Rows by Position:**
   - Use `iloc[]` to select rows based on their integer positions:

In [28]:
print(df)
print("-------------")
selected_row_position = df.iloc[1]  # Select the second row (position 1)
print(selected_row_position)

      Name  Age          City
0   Laxman   25          Pune
1   Rajesh   30     Hyderabad
2      Ram   22  Mahabubnagar
3    Ganga   32  Mahabubnagar
4   Jamuna   32          Pune
5  Namrata   15          Pune
6   Varsha   16     Hyderabad
7   Vamshi   22     Hyderabad
8   Ananya   14  Mahabubnagar
-------------
Name       Rajesh
Age            30
City    Hyderabad
Name: 1, dtype: object


2. **Selecting Specific Columns for a Row by Position:**
   - Specify both row position and column position to select a specific value:

In [25]:
specific_value_position = df.iloc[1, 0]  # Select the first column for the second row
print(specific_value_position)

Rajesh


3. **Slicing Rows by Position:**
   - Use slicing with integer positions to select a range of rows:

In [26]:
sliced_rows_position = df.iloc[1:4]  # Select rows with positions 1 to 3
print(sliced_rows_position)

     Name  Age          City
1  Rajesh   30     Hyderabad
2     Ram   22  Mahabubnagar
3   Ganga   32  Mahabubnagar


4. **Selecting Rows and Columns Simultaneously by Position:**
   - Use `iloc[]` to select specific rows and columns by position:

In [29]:
selected_data_position = df.iloc[1:4, [0, 1]]
print(selected_data_position)

     Name  Age
1  Rajesh   30
2     Ram   22
3   Ganga   32


**Conditional Indexing and Boolean Indexing:**

1. **Conditional Indexing:**
   - Use boolean conditions to filter rows based on a specific criterion:

In [30]:
condition = df['Age'] > 30
conditionally_selected = df.loc[condition]
print(conditionally_selected)

     Name  Age          City
3   Ganga   32  Mahabubnagar
4  Jamuna   32          Pune


2. **Boolean Indexing:**
   - Use boolean arrays directly for filtering:

In [31]:
boolean_selected = df[df['Age'] > 30]
print(boolean_selected)

     Name  Age          City
3   Ganga   32  Mahabubnagar
4  Jamuna   32          Pune


**Conclusion:**
Indexing and selecting data in a Pandas DataFrame using `loc[]` and `iloc[]` are powerful techniques. These methods allow you to retrieve specific rows and columns based on labels or positions. Additionally, conditional indexing and boolean indexing enable you to filter data efficiently based on specific criteria.

#### Example:


In [33]:
import pandas as pd

# Creating a sample DataFrame
data = {
    'Name': ['Laxman', 'Rajesh', 'Chanakya', 'Dravid', 'Emanuel'],
    'Age': [25, 30, 35, 22, 28],
    'Salary': [50000, 60000, 75000, 48000, 55000],
    'Experience': [3, 5, 8, 2, 4],
}

df = pd.DataFrame(data)
df.set_index('Name', inplace=True)  # Set 'Name' column as the index

# Using loc[] for Label-Based Indexing
selected_row = df.loc['Rajesh']
specific_value = df.loc['Rajesh', 'Age']
sliced_rows = df.loc['Rajesh':'Dravid']
selected_data = df.loc[['Rajesh', 'Dravid'], ['Age', 'Salary']]

# Using iloc[] for Position-Based Indexing
selected_row_position = df.iloc[1]
specific_value_position = df.iloc[1, 0]
sliced_rows_position = df.iloc[1:4]
selected_data_position = df.iloc[1:4, [0, 1]]

# Conditional Indexing and Boolean Indexing
condition = df['Age'] > 30
conditionally_selected = df.loc[condition]
boolean_selected = df[df['Age'] > 30]

# Displaying the results
print("Using loc[] for Label-Based Indexing:")
print("Selected Row:\n", selected_row)
print("Specific Value:\n", specific_value)
print("Sliced Rows:\n", sliced_rows)
print("Selected Data:\n", selected_data)

print("\nUsing iloc[] for Position-Based Indexing:")
print("Selected Row by Position:\n", selected_row_position)
print("Specific Value by Position:\n", specific_value_position)
print("Sliced Rows by Position:\n", sliced_rows_position)
print("Selected Data by Position:\n", selected_data_position)

print("\nConditional Indexing and Boolean Indexing:")
print("Conditionally Selected:\n", conditionally_selected)
print("Boolean Selected:\n", boolean_selected)


Using loc[] for Label-Based Indexing:
Selected Row:
 Age              30
Salary        60000
Experience        5
Name: Rajesh, dtype: int64
Specific Value:
 30
Sliced Rows:
           Age  Salary  Experience
Name                             
Rajesh     30   60000           5
Chanakya   35   75000           8
Dravid     22   48000           2
Selected Data:
         Age  Salary
Name               
Rajesh   30   60000
Dravid   22   48000

Using iloc[] for Position-Based Indexing:
Selected Row by Position:
 Age              30
Salary        60000
Experience        5
Name: Rajesh, dtype: int64
Specific Value by Position:
 30
Sliced Rows by Position:
           Age  Salary  Experience
Name                             
Rajesh     30   60000           5
Chanakya   35   75000           8
Dravid     22   48000           2
Selected Data by Position:
           Age  Salary
Name                 
Rajesh     30   60000
Chanakya   35   75000
Dravid     22   48000

Conditional Indexing and Boolean Ind

#### Real-world Scenario:
Imagine you have a dataset containing information about employees in a company, including their ID, name, department, salary, and performance ratings. You want to perform various operations to analyze and extract specific information about employees.

In [9]:
import pandas as pd

# Sample employee data
employee_data = {
    'EmployeeID': [101, 102, 103, 104, 105],
    'Name': ['Radhika', 'Shanaya', 'Khushi', 'Bhanu', 'Swapna'],
    'Department': ['HR', 'IT', 'Sales', 'Finance', 'Marketing'],
    'Salary': [60000, 75000, 80000, 90000, 70000],
    'PerformanceRating': [8.5, 9.2, 7.8, 8.9, 9.5],
}

# Creating a DataFrame from the employee data
df_employees = pd.DataFrame(employee_data)

# Indexing and selecting data
selected_employee = df_employees.loc[df_employees['Name'] == 'Khushi']
high_performance_employees = df_employees[df_employees['PerformanceRating'] > 9.0]
selected_columns = df_employees.loc[:, ['Name', 'Department', 'Salary']]

# Displaying the selected data
print("Selected Employee (Khushi):\n", selected_employee)
print("\nHigh-Performance Employees:\n", high_performance_employees)
print("\nSelected Columns:\n", selected_columns)


Selected Employee (Khushi):
    EmployeeID    Name Department  Salary  PerformanceRating
2         103  Khushi      Sales   80000                7.8

High-Performance Employees:
    EmployeeID     Name Department  Salary  PerformanceRating
1         102  Shanaya         IT   75000                9.2
4         105   Swapna  Marketing   70000                9.5

Selected Columns:
       Name Department  Salary
0  Radhika         HR   60000
1  Shanaya         IT   75000
2   Khushi      Sales   80000
3    Bhanu    Finance   90000
4   Swapna  Marketing   70000


#### Considerations or Peculiarities:

- **Indexing Choice:** Choose an appropriate column as the index based on your analysis needs. It could be a unique identifier like employee ID or another column that is relevant to your analysis.

- **Boolean Indexing:** Understand how to use boolean indexing effectively. It allows you to filter data based on conditions, as shown in the example with high-performance employees.

- **Column Selection:** Be mindful of the columns you select. If you only need specific columns, it's more efficient to select those rather than the entire DataFrame.

#### Common Mistakes:

- **Incorrect Syntax:** Incorrect use of square brackets, parentheses, or quotation marks in the indexing conditions can lead to errors. Always double-check syntax.

- **Using `==` for Float Comparison:** When comparing float values, be cautious due to potential precision issues. Using methods like `np.isclose()` is recommended for float comparisons.

- **Misunderstanding Boolean Indexing:** Developers may mistakenly think that boolean indexing is limited to exact matches, but it can be used for various conditions.

Indexing and selecting data are crucial skills for extracting relevant information from a DataFrame. Adjust the example code and considerations based on the specifics of your real-world scenarios and datasets.



### **`Descriptive Statistics in Pandas`**

**Introduction:**
Descriptive statistics aim to summarize and describe the main features of a dataset. Pandas provides various functions to compute descriptive statistics for each column in a DataFrame.

**1. Mean:**
   - **Definition:** The mean, also known as the average, is the sum of all values in a dataset divided by the number of observations.
   - **Pandas Code:**
     ```python
     mean_values = df.mean()
     ```
   - **Interpretation:** The mean provides a measure of central tendency, indicating the typical value in a dataset.

**2. Median:**
   - **Definition:** The median is the middle value in a dataset when it is sorted in ascending order. It is less sensitive to extreme values than the mean.
   - **Pandas Code:**
     ```python
     median_values = df.median()
     ```
   - **Interpretation:** The median gives insight into the central position of the data, especially in the presence of outliers.

**3. Mode:**
   - **Definition:** The mode represents the most frequently occurring value(s) in a dataset.
   - **Pandas Code:**
     ```python
     mode_values = df.mode().iloc[0]
     ```
   - **Interpretation:** Identifying the mode helps in understanding the most common values in a dataset.

**4. Standard Deviation:**
   - **Definition:** The standard deviation measures the amount of variation or dispersion in a set of values. A higher standard deviation indicates greater variability.
   - **Pandas Code:**
     ```python
     std_deviation = df.std()
     ```
   - **Interpretation:** Standard deviation is crucial for assessing the spread of values around the mean.

**5. Variance:**
   - **Definition:** Variance is the average of the squared differences from the mean. It is the square of the standard deviation.
   - **Pandas Code:**
     ```python
     variance_values = df.var()
     ```
   - **Interpretation:** Variance provides another measure of data dispersion, useful in comparing the spread of different datasets.

**6. Quantiles and Percentiles:**
   - **Definition:** Quantiles divide a dataset into intervals with equal probabilities. Percentiles are specific quantiles expressed as percentages.
   - **Pandas Code:**
     ```python
     quantiles = df.quantile([0.25, 0.5, 0.75])
     ```
   - **Interpretation:** Quantiles help in understanding the distribution and identifying central points in the data.

**7. Interquartile Range (IQR):**
   - **Definition:** IQR is the range between the first quartile (25th percentile) and the third quartile (75th percentile). It provides a measure of statistical dispersion.
   - **Pandas Code:**
     ```python
     iqr_values = quantiles.loc[0.75] - quantiles.loc[0.25]
     ```
   - **Interpretation:** IQR is useful for identifying potential outliers and understanding the bulk of the data distribution.

**8. Skewness:**
   - **Definition:** Skewness measures the asymmetry of a distribution. Positive skewness indicates a right-skewed distribution, while negative skewness indicates a left-skewed distribution.
   - **Pandas Code:**
     ```python
     skewness_values = df.skew()
     ```
   - **Interpretation:** Skewness provides insights into the shape of the distribution.

**9. Kurtosis:**
   - **Definition:** Kurtosis measures the sharpness of the peak (or tails) of a distribution. High kurtosis indicates a sharp peak and heavy tails.
   - **Pandas Code:**
     ```python
     kurtosis_values = df.kurt()
     ```
   - **Interpretation:** Kurtosis helps in understanding the tails' thickness and the presence of outliers.

**10. Correlation and Covariance:**
   - **Definition:** Correlation measures the linear relationship between two variables, while covariance measures their joint variability.
   - **Pandas Code:**
     ```python
     correlation_matrix = df.corr()
     covariance_matrix = df.cov()
     ```
   - **Interpretation:** Correlation and covariance are crucial for understanding relationships between variables.

**Conclusion:**
Descriptive statistics in Pandas provide a comprehensive view of the distribution, relationships, and variability within a dataset. Understanding these measures is fundamental for data analysis and decision-making. The choice of which statistics to use depends on the nature of the data and the questions you want to answer.

#### Example :


In [10]:
import pandas as pd
import numpy as np

# Creating a sample DataFrame
data = {
    'Age': [25, 30, 35, 22, 28],
    'Salary': [50000, 60000, 75000, 48000, 55000],
    'Experience': [3, 5, 8, 2, 4],
}

df = pd.DataFrame(data)

# Mean, Median, and Mode
mean_values = df.mean()
median_values = df.median()
mode_values = df.mode().iloc[0]

# Measures of Dispersion
std_deviation = df.std()
variance_values = df.var()

# Quantiles and Percentiles
quantiles = df.quantile([0.25, 0.5, 0.75])
iqr_values = quantiles.loc[0.75] - quantiles.loc[0.25]

# Summary Statistics
summary_stats = df.describe()

# Skewness and Kurtosis
skewness_values = df.skew()
kurtosis_values = df.kurt()

# Correlation and Covariance
correlation_matrix = df.corr()
covariance_matrix = df.cov()

# Displaying the results
print("Mean Values:\n", mean_values)
print("\nMedian Values:\n", median_values)
print("\nMode Values:\n", mode_values)
print("\nStandard Deviation:\n", std_deviation)
print("\nVariance Values:\n", variance_values)
print("\nQuantiles:\n", quantiles)
print("\nInterquartile Range (IQR):\n", iqr_values)
print("\nSummary Statistics:\n", summary_stats)
print("\nSkewness Values:\n", skewness_values)
print("\nKurtosis Values:\n", kurtosis_values)
print("\nCorrelation Matrix:\n", correlation_matrix)
print("\nCovariance Matrix:\n", covariance_matrix)


Mean Values:
 Age              28.0
Salary        57600.0
Experience        4.4
dtype: float64

Median Values:
 Age              28.0
Salary        55000.0
Experience        4.0
dtype: float64

Mode Values:
 Age              22
Salary        48000
Experience        2
Name: 0, dtype: int64

Standard Deviation:
 Age               4.949747
Salary        10784.247772
Experience        2.302173
dtype: float64

Variance Values:
 Age                  24.5
Salary        116300000.0
Experience            5.3
dtype: float64

Quantiles:
        Age   Salary  Experience
0.25  25.0  50000.0         3.0
0.50  28.0  55000.0         4.0
0.75  30.0  60000.0         5.0

Interquartile Range (IQR):
 Age               5.0
Salary        10000.0
Experience        2.0
dtype: float64

Summary Statistics:
              Age        Salary  Experience
count   5.000000      5.000000    5.000000
mean   28.000000  57600.000000    4.400000
std     4.949747  10784.247772    2.302173
min    22.000000  48000.000000    2

#### Real-world Scenario:
Consider a scenario where you have a dataset containing information about the performance of students in an educational institution. The dataset includes student IDs, exam scores in different subjects, attendance percentages, and participation in extracurricular activities. You want to extract descriptive statistics to gain insights into the students' academic performance.

In [11]:
import pandas as pd

# Sample student performance data
student_data = {
    'StudentID': [1, 2, 3, 4, 5],
    'Math_Score': [85, 90, 78, 92, 88],
    'English_Score': [75, 85, 80, 88, 92],
    'Attendance_Percentage': [92, 95, 88, 97, 93],
    'Extracurricular_Participation': [2, 3, 1, 4, 2],
}

# Creating a DataFrame from the student data
df_students = pd.DataFrame(student_data)

# Extracting Descriptive Statistics
mean_scores = df_students.mean()
median_scores = df_students.median()
std_deviation_scores = df_students.std()
attendance_summary = df_students['Attendance_Percentage'].describe()
correlation_matrix = df_students.corr()

# Displaying the Descriptive Statistics
print("Mean Scores:\n", mean_scores)
print("\nMedian Scores:\n", median_scores)
print("\nStandard Deviation of Scores:\n", std_deviation_scores)
print("\nAttendance Summary:\n", attendance_summary)
print("\nCorrelation Matrix:\n", correlation_matrix)


Mean Scores:
 StudentID                         3.0
Math_Score                       86.6
English_Score                    84.0
Attendance_Percentage            93.0
Extracurricular_Participation     2.4
dtype: float64

Median Scores:
 StudentID                         3.0
Math_Score                       88.0
English_Score                    85.0
Attendance_Percentage            93.0
Extracurricular_Participation     2.0
dtype: float64

Standard Deviation of Scores:
 StudentID                        1.581139
Math_Score                       5.458938
English_Score                    6.670832
Attendance_Percentage            3.391165
Extracurricular_Participation    1.140175
dtype: float64

Attendance Summary:
 count     5.000000
mean     93.000000
std       3.391165
min      88.000000
25%      92.000000
50%      93.000000
75%      95.000000
max      97.000000
Name: Attendance_Percentage, dtype: float64

Correlation Matrix:
                                StudentID  Math_Score  English_